Rework vectorized load/stores. #1457

csarofeen · 2022-02-10T19:36:39Z

Fixed instances where our vectorized support wasn't actually generating vectorized sass. Tried to make the usage of Array more explicit in allocation rather than dynamic casting to it.

naoyam

LGTM. Added some questions and comments.

naoyam · 2022-02-10T20:28:21Z

torch/csrc/jit/codegen/cuda/runtime/array.cu

+      loadGeneric<scalar_t, vec_size>(to, from);
+      break;
+    case 8: {
+      uint2 const data = *reinterpret_cast<uint2 const*>(from);


Just out of curiosity, is this copy necessary for using the inline assembly?

I think we might be able to do const& but the reinterpret cast is necessary.

naoyam · 2022-02-10T20:36:30Z

torch/csrc/jit/codegen/cuda/codegen.cpp

+        auto out_tv = uop->out()->as<kir::TensorIndex>()->view();
+        if (uop->in()->isScalar()) {
+          if (out_tv->getMemoryType() == MemoryType::Local) {
+            // Vectorized intiialization


type: initialization

naoyam · 2022-02-10T20:38:21Z

torch/csrc/jit/codegen/cuda/runtime/array.cu

+}
+
+template <typename scalar_t, int vec_size>
+__device__ void loadLocalToGlobal(scalar_t* to, scalar_t* from) {


Do we also need to have variations for shared mem? Or does the compiler property take care of those cases?

I think we probably will want shared memory versions, maybe ldg.sts versions, it's open to expansion for sure. I don't have a case where smem doesn't do the right thing, but we're really not using that path much atm.

I have seen auto-vectorize on smem access but not yet seen any de-vectorize. So I guess loadGeneric should be good for a while.

Also smem pointer would require a cvta inst and an extra register for that output, which I thought might affect register allocation if we use asm to do it.

naoyam · 2022-02-10T20:56:46Z

torch/csrc/jit/codegen/cuda/codegen.cpp

+      if (alias_tv->getMemoryType() == MemoryType::Local &&
+          va.find(alias_tv) != va.end()) {
+        indent() << "auto& " << varName(tv) << " = " << varName(alias_tv)
+                 << ";\n";
+      } else {
+        indent() << buffer_dtype << "* " << varName(tv) << " = "
+                 << varName(alias_tv) << ";\n";
+      }


Some comment describing why the local and vectorized case needs the special handling would be great.

I assume we could always generate auto& .... Would there be any preference to have the pointer style of code?

I don't think any explicit benefit, it helps me look through the code quickly just because I know the difference, but agreed we could just do auto&

shmsong

Overall looks good to me. Some minor discussions.

shmsong · 2022-02-10T21:34:56Z

torch/csrc/jit/codegen/cuda/runtime/array.cu

+
+// Used for vectorized allocations that are not in registers
+template <typename scalar_t, int vec_size>
+void arraySet(scalar_t* buff, scalar_t val) {


Would we need specializations for vectorized initialization? We could also rely on compiler's auto-vectorization pass.

I didn't explicitly check as it is unlikely to be a perf bottleneck for memory bound ops. If it's necessary we could definitely do it.

Sure. I guess it would be if we have bank conflicts in initialization or limited by Inst cache. I will add them if I see anything limited by that.

shmsong · 2022-02-10T21:40:33Z

torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp

+      return false;
+    }
+
+    // Shared memory is all aligned to 128 bits, local memory might not be


Just curious do we already have 128b alignment for all smem or it's in a different PR?

I thought we already align it, is it not the case?

Looks like TOT aligns to data type size:

https://github.com/csarofeen/pytorch/blob/devel/torch/csrc/jit/codegen/cuda/codegen.cpp#L1293-L1308

I modified the static version in #1439 .

shmsong · 2022-02-10T21:43:40Z

torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp

+    }
+
+    // Shared memory is all aligned to 128 bits, local memory might not be
+    if (this_tv->getMemoryType() == MemoryType::Local &&


We might want to move this to line 922. This seems to apply for outer sharing as well.

@shmsong can you double check I did this right?

Rework vectorized load/stores.

4d6ad13

csarofeen requested review from naoyam and shmsong February 10, 2022 19:36

naoyam approved these changes Feb 10, 2022

View reviewed changes

shmsong approved these changes Feb 10, 2022

View reviewed changes

Minor cleanup.

772a263

csarofeen merged commit 44e8c15 into devel Feb 11, 2022

csarofeen mentioned this pull request Feb 11, 2022

Mma operator and volta mma integration #1439

Merged

5 tasks

csarofeen deleted the vectorize_rework branch May 7, 2022 23:52

Rework vectorized load/stores. #1457

Rework vectorized load/stores. #1457

Uh oh!

Conversation

csarofeen commented Feb 10, 2022

Uh oh!

naoyam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shmsong Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shmsong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shmsong Feb 10, 2022 •

edited

Loading