-
Notifications
You must be signed in to change notification settings - Fork 265
[Bugfix]:Fix atomic add auto vectorize memory access out of bound error #742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughRefactors innermost-loop vectorization in Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant IR as Input IR (loops)
participant Pass as AtomicAddVectorize
participant Map as Substitution Map (vmap)
participant IRp as Transformed IR
IR->>Pass: Visit innermost loop
Pass->>Map: Build vmap
Note right of Map #DDEFEF: tx_var = tx_var * vector_size_\nloop_var = outer_var * vector_size_\n(replaces truncmod/truncdiv)
Pass->>IRp: Substitute loop body using vmap
Pass-->>IRp: Preserve outer bounds and vector-size checks
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @yyttt6, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses and fixes an out-of-bounds memory access error that could occur during the atomic add auto-vectorization pass. The core of the fix involves simplifying the index calculation logic within the AtomicAddVectorizeRewriter
to ensure that memory addresses are correctly computed during vectorized operations, thereby preventing erroneous access.
Highlights
- Bug Fix: Corrected an out-of-bounds memory access error occurring during the atomic add auto-vectorization process.
- Vectorization Logic Refinement: Simplified the variable mapping within the AtomicAddVectorizeRewriter by removing complex truncmod and truncdiv operations, ensuring accurate memory addressing for vectorized atomic additions.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request provides a crucial fix for an out-of-bounds memory access bug within the atomic add auto-vectorization logic. The previous variable remapping was complex and could generate invalid indices. The updated mapping is simpler and more robust, ensuring that loop variables remain within their expected bounds. This is a solid improvement for correctness and stability.
vmap.Set(tx_var, tx_var * vector_size_); | ||
vmap.Set(fnode->loop_var, outer_var * vector_size_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This simplified variable mapping is a great improvement. The previous logic for remapping fnode->loop_var
was prone to out-of-bounds errors when the number of threads exceeded the loop's extent, because truncdiv(tx_var, extent / vector_size_)
could grow too large.
By removing the dependency on tx_var
for the loop variable and using a simpler mapping, you've eliminated this bug and made the transformation more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/transform/atomicadd_vectorize.cc (1)
173-176
: Use the known thread_var instead of scanning for "tx"; optionally fold constants post-subst.Two small polish items:
- Relying on name_hint == "tx" is brittle. You already have thread_var in VectorizeAtomicAdd; pass it into the rewriter and use it directly in vmap.
- After substitution, folding trivial arithmetic (e.g., outer_var * 2 inside indices) would help, though later passes may handle it.
You can pass the thread var into the rewriter and avoid the name-based lookup:
// Constructor class AtomicAddVectorizeRewriter : public StmtExprMutator { public: AtomicAddVectorizeRewriter(AtomicAddVectorizePlanResult plan, Var thread_var) : vector_size_(plan.vector_size), condition_(plan.condition), dynamic_(plan.dynamic), thread_var_(std::move(thread_var)) {} private: Var thread_var_; ... }; // Call site auto rewriter = AtomicAddVectorizeRewriter(res, thread_var); // In VisitStmt_ where vmap is built // Replace the tx discovery block with: Var tx_var = thread_var_; // And keep: vmap.Set(tx_var, tx_var * vector_size_); vmap.Set(fnode->loop_var, outer_var * vector_size_);Optionally, if you want immediate simplification after substitution:
Stmt body = Substitute(fnode->body, vmap); // (Optional) If you have an Analyzer at hand: // arith::Analyzer analyzer; // body = analyzer.Simplify(body);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
src/transform/atomicadd_vectorize.cc
(1 hunks)
🔇 Additional comments (1)
src/transform/atomicadd_vectorize.cc (1)
173-175
: Extern definitions for AtomicAddx2 and AtomicAddx4 verifiedSearch results confirm that both intrinsics are defined and linked:
AtomicAddx2
(TL_DEVICE overloads for half_t, bfloat16_t, and float) insrc/tl_templates/cuda/common.h
(lines 164–190)AtomicAddx4
(TL_DEVICE for float4) insrc/tl_templates/cuda/common.h
(lines 191–195)- Python-side wrappers in
tilelang/language/customize.py
invoke these externs viaT.call_extern
Vector index mapping fix remains correct and there are no unresolved symbol risks at runtime. Ready to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/transform/atomicadd_vectorize.cc (1)
119-124
: Guard against vector_size_ dropping to 0 in refinement loopIf IndiceCanVectorize(...) returns false even for 1, the current loop halves to 0 and can keep spinning or pass 0 into the predicate. Add a lower-bound guard and early-exit fallback to “no vectorization.”
Apply this diff:
- while (!IndiceCanVectorize(elem_offset, thread_var, thread_extent, - vector_size_, &analyzer_)) { - vector_size_ /= 2; - } + while (vector_size_ > 1 && + !IndiceCanVectorize(elem_offset, thread_var, thread_extent, + vector_size_, &analyzer_)) { + vector_size_ /= 2; + } + // If still not vectorizable at width 1, keep it scalar (no vectorization). + if (vector_size_ == 1 && + !IndiceCanVectorize(elem_offset, thread_var, thread_extent, + vector_size_, &analyzer_)) { + return; + }
🧹 Nitpick comments (4)
src/transform/atomicadd_vectorize.cc (4)
162-171
: Avoid hard-coding the thread var name "tx"Relying on a name-hint match is brittle. Prefer the bound thread axis from the For’s thread_binding (if present), and only fall back to scanning. This makes the transform robust to different naming (e.g., “threadIdx.x”, “tx0”, etc.).
Apply this diff:
- Var tx_var; - PostOrderVisit(fnode->body, [&tx_var](const ObjectRef &node) { - if (const VarNode *var = node.as<VarNode>()) { - if (var->name_hint == "tx") { - tx_var = GetRef<Var>(var); - } - } - }); - ICHECK(tx_var.defined()) << "Failed to find tx var"; + Var tx_var; + if (fnode->thread_binding.defined()) { + tx_var = fnode->thread_binding.value()->var; + } + if (!tx_var.defined()) { + PostOrderVisit(fnode->body, [&tx_var](const ObjectRef &node) { + if (const VarNode *var = node.as<VarNode>()) { + if (var->name_hint == "tx") { + tx_var = GetRef<Var>(var); + } + } + }); + } + ICHECK(tx_var.defined()) << "Failed to locate thread var (thread_binding or 'tx')";
90-93
: Comment contradicts behavior; clarify intentThe comment says we “disable this GCD optimization,” but the code still applies GCD with the loop extent. Update the comment to reflect the conservative GCD intersection actually used.
Apply this diff:
- // i // 2, i % 8 can also be vectorized as factor 16 - // so we should disable this GCD optimization + // Patterns like i // 2 and i % 8 could allow larger vector factors, + // but we conservatively intersect with the loop extent via GCD.
125-129
: Dynamic path computes condition_ but it’s unused in rewritingPlanner records condition_ for dynamic shapes, but the rewriter returns the original loop when dynamic_ is true and never consults condition_. Either:
- plumb condition_ into the rewriter and guard with IfThenElse(condition_, vectorized body, scalar fallback), or
- drop condition_ from the plan to avoid dead state.
I can provide a guarded rewrite sketch if you want to support conditional (tail) vectorization for dynamic shapes.
Also applies to: 221-223
247-279
: Add a regression test for the original OOB scenarioGiven the fix specifically targets OOB with atomic add auto-vectorization when thread count exceeds loop extent, consider adding a test that:
- constructs such a schedule,
- triggers vectorization to 2/4,
- verifies no OOB and correct results.
I can help draft a minimal TIR or Tile schedule reproducer if you share your preferred test harness.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
src/transform/atomicadd_vectorize.cc
(1 hunks)
🧰 Additional context used
🪛 GitHub Actions: CI
src/transform/atomicadd_vectorize.cc
[error] 1-1: Tile-lang formatting check failed. Command: './format.sh'. Reformatted files were detected and not staged for commit. Please review and stage the changes (git add src/transform/atomicadd_vectorize.cc) before committing.
🪛 GitHub Actions: CI Test on AMD
src/transform/atomicadd_vectorize.cc
[error] 1-1: Reformatted file 'src/transform/atomicadd_vectorize.cc' by formatting tools (via ./format.sh). Changes not staged for commit; please review and stage the changes.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: bot-task
🔇 Additional comments (2)
src/transform/atomicadd_vectorize.cc (2)
173-176
: Simplified vmap for tx/loop-var scaling looks correct and saferMapping tx and the loop var to contiguous vector-sized chunks is a clear improvement and should prevent the previous out-of-bounds scenarios tied to modulo/division remapping. The substitution feeds a contiguous base to AtomicAddx2/x4, aligning with the intended vector lanes.
1-1
: Formatting script failure: missing pipIt looks like
./format.sh
is failing at line 23 due to a missingpip
command, so the file wasn’t reformatted or staged. Please:
- Install Python’s package installer (e.g.
python3-pip
) so thatpip
is available.- Re-run
./format.sh
and confirm it completes without errors.- Verify no unstaged changes remain for
src/transform/atomicadd_vectorize.cc
:git status --porcelain git diff -- src/transform/atomicadd_vectorize.cc || true- Stage and commit the reformatted file:
git add src/transform/atomicadd_vectorize.cc git commit -m "Format src/transform/atomicadd_vectorize.cc"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/transform/atomicadd_vectorize.cc (1)
173-176
: The simplified vmap removes the truncdiv/truncmod pitfalls and fixes the OOB risk.Mapping loop_var → outer_var * vector_size_ and scaling tx by vector_size_ yields aligned bases for AtomicAddxN and eliminates the prior growth of the loop index with thread count. Nice cleanup.
🧹 Nitpick comments (1)
src/transform/atomicadd_vectorize.cc (1)
173-176
: Confirm that globally scalingtx
won’t change non-index semantics; consider narrowing the substitution.Substituting
tx_var -> tx_var * vector_size_
is applied to the entire body. Iftx
participates in non-indexing logic (e.g., guards, shared mem indexing, or sync patterns), scaling could alter semantics. Safer alternative: restrict scaling to the address computations for the AtomicAdd’s src/dst indices in this innermost loop, or gate the scaling behind a planner-derived check thattx
contributes only to the last-dim linear offset.If this is already guaranteed by IndiceCanVectorize/your scheduling constraints, please confirm.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
src/transform/atomicadd_vectorize.cc
(1 hunks)
Docstrings generation was requested by @LeiWang1999. * #742 (comment) The following files were modified: * `src/transform/atomicadd_vectorize.cc`
Note Generated docstrings for this pull request at #745 |
* 📝 Add docstrings to `main` Docstrings generation was requested by @LeiWang1999. * #742 (comment) The following files were modified: * `src/transform/atomicadd_vectorize.cc` * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <[email protected]>
* 📝 Add docstrings to `main` Docstrings generation was requested by @LeiWang1999. * tile-ai/tilelang#742 (comment) The following files were modified: * `src/transform/atomicadd_vectorize.cc` * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <[email protected]>
* 📝 Add docstrings to `main` Docstrings generation was requested by @LeiWang1999. * tile-ai/tilelang#742 (comment) The following files were modified: * `src/transform/atomicadd_vectorize.cc` * lint fix --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: LeiWang1999 <[email protected]>
Summary by CodeRabbit
Refactor
Performance
Bug Fixes