-
Notifications
You must be signed in to change notification settings - Fork 254
[CI][AMD] Add AMD GPU CI and fix some related bugs #694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nd clarity (tile-ai#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices.
…ed flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters.
… example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance.
- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability.
- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization.
- Added a new example script `example_amd_flash_attn_fwd_k_block.py` for FlashAttention with K-blocking support. - Enhanced `example_amd_flash_attn_fwd.py` by expanding configuration options for block sizes and threads. - Updated the TVM submodule to the latest commit for improved functionality. - Introduced a new test script `test.sh` to facilitate running the new example with specified parameters.
- Introduced a new GitHub Actions workflow in `amd_ci.yml` to automate format checks and testing for pull requests. - The workflow includes steps for setting up a Python environment, running format checks, and executing tests. - Removed obsolete example script `example_amd_flash_attn_fwd_k_block.py` and test script `test.sh` to streamline the examples directory.
…orchaudio packages to the virtual environment for improved dependency management.
…quirements-test.txt
…nstall dependencies from requirements-test.txt
…from requirements-test.txt before installation
…dependency management.
…e the build-test process by directly copying PyTorch and TorchVision packages to the virtual environment.
- Included a step to execute the `install_rocm.sh` script for improved setup. - Removed unnecessary blank line for better readability in the workflow script.
…instead of all tests.
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @Alex4210987, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request focuses on improving the AMD Flash Attention example by expanding its autotuning configuration space and correcting a variable indexing issue. Additionally, it updates the TVM submodule and refactors the HIP backend code to use the new FFI function registration mechanism, while also introducing new built-in functions for MFMA operations.
Highlights
- Submodule Update: The 3rdparty/tvm submodule has been updated to a newer commit, bringing in upstream changes.
- AMD Flash Attention Example Improvement: The autotuner configuration ranges for block_M, block_N, threads, and num_split_q in example_amd_flash_attn_fwd.py have been expanded to explore a wider parameter space.
- AMD Flash Attention Example Fix: A variable indexing issue (bx[0] to bx) was corrected in the AMD Flash Attention example, ensuring proper iteration logic.
- New Built-in Functions: Two new built-in functions, tvm_mfma and tvm_mfma_store, have been added to src/op/builtin.cc to support MFMA (Matrix Fused Multiply-Add) operations.
- HIP Backend FFI Migration: The HIP backend code (src/target/codegen_hip.cc and src/target/rt_mod_hip.cc) has been refactored to transition from the old tvm/runtime/registry.h API to the new tvm/ffi/function.h API for global function lookups.
- HIP Backend Constant Definition: A new constant kTVMGridConstant has been defined in src/target/rt_mod_hip.cc.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates a submodule, adds support for AMD MFMA instructions, and fixes some AMD-related code. My review found a critical issue in example_amd_flash_attn_fwd.py
where a change in handling a loop variable will likely lead to an infinite loop. I've provided suggestions to fix this. Additionally, I've pointed out some code duplication in src/target/rt_mod_hip.cc
and suggested a refactoring to improve maintainability. The other changes related to the new FFI API and built-ins look correct.
bx = b_split | ||
|
||
with T.While(bx[0] < num_q_blocks): | ||
with T.While(bx < num_q_blocks): | ||
acc_o = T.alloc_fragment([block_M, dim], accum_dtype) | ||
m_i = T.alloc_fragment([block_M], accum_dtype) | ||
l_i = T.alloc_fragment([block_M], accum_dtype) | ||
T.fill(acc_o, 0) | ||
T.fill(m_i, -T.infinity(accum_dtype)) | ||
T.fill(l_i, 0) | ||
|
||
current_bx = bx[0] | ||
current_bx = bx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes appear to introduce a critical bug. The original implementation correctly used bx
as a mutable scalar variable (a buffer of size 1) to manage the state of the while
loop. By changing bx[0] = b_split
to bx = b_split
, the Python variable bx
is rebound to the immutable outer loop variable b_split
. Consequently, the update on line 190 becomes a local Python reassignment that does not affect the generated TIR code. This will likely result in an infinite loop. The T.alloc_var
on line 107 also becomes dead code. Please revert these changes to use bx[0]
for accessing and updating the loop counter.
bx = b_split | |
with T.While(bx[0] < num_q_blocks): | |
with T.While(bx < num_q_blocks): | |
acc_o = T.alloc_fragment([block_M, dim], accum_dtype) | |
m_i = T.alloc_fragment([block_M], accum_dtype) | |
l_i = T.alloc_fragment([block_M], accum_dtype) | |
T.fill(acc_o, 0) | |
T.fill(m_i, -T.infinity(accum_dtype)) | |
T.fill(l_i, 0) | |
current_bx = bx[0] | |
current_bx = bx | |
bx[0] = b_split | |
with T.While(bx[0] < num_q_blocks): | |
acc_o = T.alloc_fragment([block_M, dim], accum_dtype) | |
m_i = T.alloc_fragment([block_M], accum_dtype) | |
l_i = T.alloc_fragment([block_M], accum_dtype) | |
T.fill(acc_o, 0) | |
T.fill(m_i, -T.infinity(accum_dtype)) | |
T.fill(l_i, 0) | |
current_bx = bx[0] |
Output[bz, q_block_offset + i, by, j] = acc_o[i, j] * l_inv[i] | ||
|
||
bx[0] = current_bx + num_split_q | ||
bx = current_bx + num_split_q |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::string code = cg.Finish(); | ||
if (const auto *f = Registry::Get("tilelang_callback_hip_postproc")) { | ||
code = (*f)(code, target).operator std::string(); | ||
|
||
// Use the new FFI API to get registered functions | ||
using ffi::Function; | ||
if (auto f = Function::GetGlobal("tilelang_callback_hip_postproc")) { | ||
code = (*f)(code, target).cast<std::string>(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code for post-processing is duplicated from lines 66-72 in BuildTileLangHIP
. To improve maintainability and avoid code duplication, consider extracting this logic into a separate helper function. For example:
namespace {
std::string PostProcessHipCode(std::string code, const Target& target) {
using tvm::ffi::Function;
if (auto f = Function::GetGlobal("tilelang_callback_hip_postproc")) {
code = (*f)(code, target).cast<std::string>();
}
return code;
}
} // namespace
You can then call this helper from both BuildTileLangHIP
and BuildTileLangHIPWithoutCompile
.
- Introduced `tvm_mfma`, `tvm_mfma_store`, `tvm_rdna_wmma`, and `tvm_rdna_wmma_store` built-in operations to enhance support for matrix multiplication and storage in tilelang. - Each operation is configured with the appropriate number of inputs and marked as opaque in terms of call effects.
- Updated block sizes and num_split_q parameters in `get_configs` for improved autotuning. - Modified `T.gemm` calls in `fast_flashattn` to utilize `GemmWarpPolicy.FullRow`, optimizing performance for matrix multiplications.
- Refined block sizes, thread counts, and added new parameters in `get_configs` to optimize autotuning. - Adjusted `fast_flashattn` function to incorporate new parameters for panel size and coalesced widths, improving memory access patterns.
- Reformatted the `get_configs` function for improved readability by aligning parameters. - Adjusted the `fast_flashattn` function to enhance clarity in the conditional logic for accumulation, ensuring better handling of causal conditions.
- Added echo statements to provide feedback during the CI process, indicating when the environment is running on an AMD GPU, copying necessary packages, and installing requirements. - Improved clarity in the workflow by explicitly stating when the project is being installed and when tests are being executed.
…ssues during environment setup
…ove obsolete package copying steps
…mmands from AMD CI workflow
…ronment setup and automatic commit of lint changes
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. Caution Review failedThe pull request is closed. WalkthroughIntroduces an AMD-focused CI workflow, adds ROCm-specific requirements, updates an AMD FlashAttention example with new config parameters and kernel flow, adds new TL builtins for MFMA/RDNA WMMA, switches HIP backend callback plumbing to TVM FFI with new FFI exports, updates HIP codegen include/namespace, and bumps TVM submodule. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant PR as Pull Request
participant GH as GitHub Actions
participant Fmt as format-check (AMD)
participant Build as build-test-amd (AMD)
PR->>GH: Open/Update PR
GH->>Fmt: Run job
Fmt->>Fmt: Setup Python + venv cache
Fmt->>Fmt: Install requirements-test + flash_attn
Fmt->>Fmt: Run format.sh (auto-commit if changes)
Fmt-->>GH: Status
GH->>Build: Run (needs: format-check)
Build->>Build: Setup Python + venv cache
Build->>Build: Install ROCm torch via index + requirements-rocm
Build->>Build: Build/install wheel (USE_ROCM=True) and repo
Build->>Build: pytest testing/python/amd/test_tilelang_test_amd.py
Build-->>GH: Status
sequenceDiagram
autonumber
participant TL as BuildTileLangHIP(*)
participant FFI as TVM FFI
participant PP as hip_postproc callback
participant CC as hip_compile callback
participant MOD as runtime::Module
TL->>FFI: Function::GetGlobal("tilelang_callback_hip_postproc")
FFI-->>TL: PP
TL->>PP: Post-process(source) -> string
TL->>FFI: Function::GetGlobal("tilelang_callback_hip_compile")
FFI-->>TL: CC
TL->>CC: Compile(source, target) -> string/binary
TL->>TL: Create runtime::Module
TL-->>MOD: Return module
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: CodeRabbit UI 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (7)
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
Summary by CodeRabbit