Skip to content

Conversation

yyttt6
Copy link
Contributor

@yyttt6 yyttt6 commented Aug 20, 2025

Summary by CodeRabbit

  • Refactor

    • Simplified innermost loop vectorization mapping by replacing complex alignment math with a direct scaling approach; internal substitutions updated while public APIs remain unchanged.
  • Performance

    • Potential runtime improvements in vectorized kernels due to reduced index-computation overhead.
  • Bug Fixes

    • Eliminates edge-case misalignments that could occur under the previous alignment logic.

Copy link
Contributor

coderabbitai bot commented Aug 20, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Refactors innermost-loop vectorization in src/transform/atomicadd_vectorize.cc by replacing truncmod/truncdiv-based index substitutions with direct scaling for tx_var and loop_var. Body substitution now uses the updated vmap. No public APIs changed; outer loop bounds and vector-size checks remain unchanged.

Changes

Cohort / File(s) Summary
Vectorization rewrite
src/transform/atomicadd_vectorize.cc
Replaced modulo/division-based index mappings with scaled mappings (tx_vartx_var * vector_size_, fnode->loop_var/outer_varouter_var * vector_size_). Updated body substitution to use new vmap. Retained outer bounds and dynamic vector-size checks. No API/signature changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant IR as Input IR (loops)
  participant Pass as AtomicAddVectorize
  participant Map as Substitution Map (vmap)
  participant IRp as Transformed IR

  IR->>Pass: Visit innermost loop
  Pass->>Map: Build vmap
  Note right of Map #DDEFEF: tx_var = tx_var * vector_size_\nloop_var = outer_var * vector_size_\n(replaces truncmod/truncdiv)
  Pass->>IRp: Substitute loop body using vmap
  Pass-->>IRp: Preserve outer bounds and vector-size checks
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I twitch my whiskers, code aligned,
Mod and div now left behind.
Scaled steps drum a vector beat,
Loops march tidy on their feet.
I nibble bits and hop along—🥕

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Docstrings were successfully generated. (🔄 Check again to generate docstrings again)
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yyttt6, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses and fixes an out-of-bounds memory access error that could occur during the atomic add auto-vectorization pass. The core of the fix involves simplifying the index calculation logic within the AtomicAddVectorizeRewriter to ensure that memory addresses are correctly computed during vectorized operations, thereby preventing erroneous access.

Highlights

  • Bug Fix: Corrected an out-of-bounds memory access error occurring during the atomic add auto-vectorization process.
  • Vectorization Logic Refinement: Simplified the variable mapping within the AtomicAddVectorizeRewriter by removing complex truncmod and truncdiv operations, ensuring accurate memory addressing for vectorized atomic additions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a crucial fix for an out-of-bounds memory access bug within the atomic add auto-vectorization logic. The previous variable remapping was complex and could generate invalid indices. The updated mapping is simpler and more robust, ensuring that loop variables remain within their expected bounds. This is a solid improvement for correctness and stability.

Comment on lines +173 to +174
vmap.Set(tx_var, tx_var * vector_size_);
vmap.Set(fnode->loop_var, outer_var * vector_size_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This simplified variable mapping is a great improvement. The previous logic for remapping fnode->loop_var was prone to out-of-bounds errors when the number of threads exceeded the loop's extent, because truncdiv(tx_var, extent / vector_size_) could grow too large.

By removing the dependency on tx_var for the loop variable and using a simpler mapping, you've eliminated this bug and made the transformation more robust.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/transform/atomicadd_vectorize.cc (1)

173-176: Use the known thread_var instead of scanning for "tx"; optionally fold constants post-subst.

Two small polish items:

  • Relying on name_hint == "tx" is brittle. You already have thread_var in VectorizeAtomicAdd; pass it into the rewriter and use it directly in vmap.
  • After substitution, folding trivial arithmetic (e.g., outer_var * 2 inside indices) would help, though later passes may handle it.

You can pass the thread var into the rewriter and avoid the name-based lookup:

// Constructor
class AtomicAddVectorizeRewriter : public StmtExprMutator {
 public:
  AtomicAddVectorizeRewriter(AtomicAddVectorizePlanResult plan, Var thread_var)
      : vector_size_(plan.vector_size),
        condition_(plan.condition),
        dynamic_(plan.dynamic),
        thread_var_(std::move(thread_var)) {}
 private:
  Var thread_var_;
  ...
};

// Call site
auto rewriter = AtomicAddVectorizeRewriter(res, thread_var);

// In VisitStmt_ where vmap is built
// Replace the tx discovery block with:
Var tx_var = thread_var_;
// And keep:
vmap.Set(tx_var, tx_var * vector_size_);
vmap.Set(fnode->loop_var, outer_var * vector_size_);

Optionally, if you want immediate simplification after substitution:

Stmt body = Substitute(fnode->body, vmap);
// (Optional) If you have an Analyzer at hand:
// arith::Analyzer analyzer;
// body = analyzer.Simplify(body);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 72be490 and 1d8ddb7.

📒 Files selected for processing (1)
  • src/transform/atomicadd_vectorize.cc (1 hunks)
🔇 Additional comments (1)
src/transform/atomicadd_vectorize.cc (1)

173-175: Extern definitions for AtomicAddx2 and AtomicAddx4 verified

Search results confirm that both intrinsics are defined and linked:

  • AtomicAddx2 (TL_DEVICE overloads for half_t, bfloat16_t, and float) in src/tl_templates/cuda/common.h (lines 164–190)
  • AtomicAddx4 (TL_DEVICE for float4) in src/tl_templates/cuda/common.h (lines 191–195)
  • Python-side wrappers in tilelang/language/customize.py invoke these externs via T.call_extern

Vector index mapping fix remains correct and there are no unresolved symbol risks at runtime. Ready to merge.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/transform/atomicadd_vectorize.cc (1)

119-124: Guard against vector_size_ dropping to 0 in refinement loop

If IndiceCanVectorize(...) returns false even for 1, the current loop halves to 0 and can keep spinning or pass 0 into the predicate. Add a lower-bound guard and early-exit fallback to “no vectorization.”

Apply this diff:

-      while (!IndiceCanVectorize(elem_offset, thread_var, thread_extent,
-                                 vector_size_, &analyzer_)) {
-        vector_size_ /= 2;
-      }
+      while (vector_size_ > 1 &&
+             !IndiceCanVectorize(elem_offset, thread_var, thread_extent,
+                                 vector_size_, &analyzer_)) {
+        vector_size_ /= 2;
+      }
+      // If still not vectorizable at width 1, keep it scalar (no vectorization).
+      if (vector_size_ == 1 &&
+          !IndiceCanVectorize(elem_offset, thread_var, thread_extent,
+                              vector_size_, &analyzer_)) {
+        return;
+      }
🧹 Nitpick comments (4)
src/transform/atomicadd_vectorize.cc (4)

162-171: Avoid hard-coding the thread var name "tx"

Relying on a name-hint match is brittle. Prefer the bound thread axis from the For’s thread_binding (if present), and only fall back to scanning. This makes the transform robust to different naming (e.g., “threadIdx.x”, “tx0”, etc.).

Apply this diff:

-        Var tx_var;
-        PostOrderVisit(fnode->body, [&tx_var](const ObjectRef &node) {
-          if (const VarNode *var = node.as<VarNode>()) {
-            if (var->name_hint == "tx") {
-              tx_var = GetRef<Var>(var);
-            }
-          }
-        });
-        ICHECK(tx_var.defined()) << "Failed to find tx var";
+        Var tx_var;
+        if (fnode->thread_binding.defined()) {
+          tx_var = fnode->thread_binding.value()->var;
+        }
+        if (!tx_var.defined()) {
+          PostOrderVisit(fnode->body, [&tx_var](const ObjectRef &node) {
+            if (const VarNode *var = node.as<VarNode>()) {
+              if (var->name_hint == "tx") {
+                tx_var = GetRef<Var>(var);
+              }
+            }
+          });
+        }
+        ICHECK(tx_var.defined()) << "Failed to locate thread var (thread_binding or 'tx')";

90-93: Comment contradicts behavior; clarify intent

The comment says we “disable this GCD optimization,” but the code still applies GCD with the loop extent. Update the comment to reflect the conservative GCD intersection actually used.

Apply this diff:

-    // i // 2, i % 8 can also be vectorized as factor 16
-    // so we should disable this GCD optimization
+    // Patterns like i // 2 and i % 8 could allow larger vector factors,
+    // but we conservatively intersect with the loop extent via GCD.

125-129: Dynamic path computes condition_ but it’s unused in rewriting

Planner records condition_ for dynamic shapes, but the rewriter returns the original loop when dynamic_ is true and never consults condition_. Either:

  • plumb condition_ into the rewriter and guard with IfThenElse(condition_, vectorized body, scalar fallback), or
  • drop condition_ from the plan to avoid dead state.

I can provide a guarded rewrite sketch if you want to support conditional (tail) vectorization for dynamic shapes.

Also applies to: 221-223


247-279: Add a regression test for the original OOB scenario

Given the fix specifically targets OOB with atomic add auto-vectorization when thread count exceeds loop extent, consider adding a test that:

  • constructs such a schedule,
  • triggers vectorization to 2/4,
  • verifies no OOB and correct results.

I can help draft a minimal TIR or Tile schedule reproducer if you share your preferred test harness.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1d8ddb7 and 562c6f2.

📒 Files selected for processing (1)
  • src/transform/atomicadd_vectorize.cc (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: CI
src/transform/atomicadd_vectorize.cc

[error] 1-1: Tile-lang formatting check failed. Command: './format.sh'. Reformatted files were detected and not staged for commit. Please review and stage the changes (git add src/transform/atomicadd_vectorize.cc) before committing.

🪛 GitHub Actions: CI Test on AMD
src/transform/atomicadd_vectorize.cc

[error] 1-1: Reformatted file 'src/transform/atomicadd_vectorize.cc' by formatting tools (via ./format.sh). Changes not staged for commit; please review and stage the changes.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: bot-task
🔇 Additional comments (2)
src/transform/atomicadd_vectorize.cc (2)

173-176: Simplified vmap for tx/loop-var scaling looks correct and safer

Mapping tx and the loop var to contiguous vector-sized chunks is a clear improvement and should prevent the previous out-of-bounds scenarios tied to modulo/division remapping. The substitution feeds a contiguous base to AtomicAddx2/x4, aligning with the intended vector lanes.


1-1: Formatting script failure: missing pip

It looks like ./format.sh is failing at line 23 due to a missing pip command, so the file wasn’t reformatted or staged. Please:

  • Install Python’s package installer (e.g. python3-pip) so that pip is available.
  • Re-run ./format.sh and confirm it completes without errors.
  • Verify no unstaged changes remain for src/transform/atomicadd_vectorize.cc:
    git status --porcelain
    git diff -- src/transform/atomicadd_vectorize.cc || true
  • Stage and commit the reformatted file:
    git add src/transform/atomicadd_vectorize.cc
    git commit -m "Format src/transform/atomicadd_vectorize.cc"

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/transform/atomicadd_vectorize.cc (1)

173-176: The simplified vmap removes the truncdiv/truncmod pitfalls and fixes the OOB risk.

Mapping loop_var → outer_var * vector_size_ and scaling tx by vector_size_ yields aligned bases for AtomicAddxN and eliminates the prior growth of the loop index with thread count. Nice cleanup.

🧹 Nitpick comments (1)
src/transform/atomicadd_vectorize.cc (1)

173-176: Confirm that globally scaling tx won’t change non-index semantics; consider narrowing the substitution.

Substituting tx_var -> tx_var * vector_size_ is applied to the entire body. If tx participates in non-indexing logic (e.g., guards, shared mem indexing, or sync patterns), scaling could alter semantics. Safer alternative: restrict scaling to the address computations for the AtomicAdd’s src/dst indices in this innermost loop, or gate the scaling behind a planner-derived check that tx contributes only to the last-dim linear offset.

If this is already guaranteed by IndiceCanVectorize/your scheduling constraints, please confirm.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 562c6f2 and a1c7663.

📒 Files selected for processing (1)
  • src/transform/atomicadd_vectorize.cc (1 hunks)

@LeiWang1999 LeiWang1999 merged commit ce7b932 into tile-ai:main Aug 20, 2025
5 of 6 checks passed
coderabbitai bot added a commit that referenced this pull request Aug 20, 2025
Docstrings generation was requested by @LeiWang1999.

* #742 (comment)

The following files were modified:

* `src/transform/atomicadd_vectorize.cc`
Copy link
Contributor

coderabbitai bot commented Aug 20, 2025

Note

Generated docstrings for this pull request at #745

LeiWang1999 added a commit that referenced this pull request Aug 21, 2025
* 📝 Add docstrings to `main`

Docstrings generation was requested by @LeiWang1999.

* #742 (comment)

The following files were modified:

* `src/transform/atomicadd_vectorize.cc`

* lint fix

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <[email protected]>
margaretphillips96627margaretphillips added a commit to margaretphillips96627margaretphillips/tilelang that referenced this pull request Oct 6, 2025
* 📝 Add docstrings to `main`

Docstrings generation was requested by @LeiWang1999.

* tile-ai/tilelang#742 (comment)

The following files were modified:

* `src/transform/atomicadd_vectorize.cc`

* lint fix

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <[email protected]>
ebonyschneider462359 added a commit to ebonyschneider462359/tilelang that referenced this pull request Oct 12, 2025
* 📝 Add docstrings to `main`

Docstrings generation was requested by @LeiWang1999.

* tile-ai/tilelang#742 (comment)

The following files were modified:

* `src/transform/atomicadd_vectorize.cc`

* lint fix

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants