-
Notifications
You must be signed in to change notification settings - Fork 13.8k
Add intrinsic for dynamic shared memory #146181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1716,6 +1716,8 @@ pub struct AddressSpace(pub u32); | |
impl AddressSpace { | ||
/// LLVM's `0` address space. | ||
pub const ZERO: Self = AddressSpace(0); | ||
/// The address space for shared memory on nvptx and amdgpu. | ||
pub const SHARED: Self = AddressSpace(3); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should have a name that makes it more clear that it is GPU-specific. |
||
} | ||
|
||
/// The way we represent values to the backend | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,9 @@ | ||
use std::assert_matches::assert_matches; | ||
use std::cmp::Ordering; | ||
|
||
use rustc_abi::{Align, BackendRepr, ExternAbi, Float, HasDataLayout, Primitive, Size}; | ||
use rustc_abi::{ | ||
AddressSpace, Align, BackendRepr, ExternAbi, Float, HasDataLayout, Primitive, Size, | ||
}; | ||
use rustc_codegen_ssa::base::{compare_simd_types, wants_msvc_seh, wants_wasm_eh}; | ||
use rustc_codegen_ssa::codegen_attrs::autodiff_attrs; | ||
use rustc_codegen_ssa::common::{IntPredicate, TypeKind}; | ||
|
@@ -532,6 +534,22 @@ impl<'ll, 'tcx> IntrinsicCallBuilderMethods<'tcx> for Builder<'_, 'll, 'tcx> { | |
return Ok(()); | ||
} | ||
|
||
sym::dynamic_shared_memory => { | ||
let global = self.declare_global_in_addrspace( | ||
"dynamic_shared_memory", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a special magic meaning to this hard-coded name for the global? When linking together Rust code and code in other languages, is this the name everyone must use consistently to ensure things behave correctly? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The name has no meaning. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That seems very unusual, so if the intrinsic impl relies on this there should be a comment referring to the place where this is documented. |
||
self.type_array(self.type_i8(), 0), | ||
AddressSpace::SHARED, | ||
); | ||
Comment on lines
+537
to
+542
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm. I'm not sure this is the correct design for this, because it makes the calls to an intrinsic alter global program state during compilation, which feels very dicey and assumes that the calls to that intrinsic will experience codegen. I'm not sure how substantial this concern is, however, so I'm willing to be persuaded. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah good point... if you call this function in dead code, it may or may not have any effect. How do other languages handle controlling the alignment of this magic global? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be fair, even if it is called, it does not have any effect. It’s just a getter for a pointer that exists anyway. It does not change anything if it’s dead-code-eliminated. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We're talking about the funky behavior where DCE'ing the intrinsic removes its compile-time effect on the alignment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah right, I missed that. I think it would still do the correct thing as documented. All calls that are not eliminated get a pointer that’s sufficiently aligned for their type.
I think I can answer this now. You define an extern global of some type and the alignment of type is used. Unused extern globals can of course be removed/ignored. (This matters more when some of the globals are only used by some of the kernels defined in a program. A kernel only gets the alignment specified by globals it uses.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm. I guess I'm worrying about someone calling it with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That'd be wrong, yeah, but it's an unsafe operation after all so it's not surprising that you can use it wrong. Do you have another API in mind that would make this kind of mistake less likely? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm sort of wondering if this should resemble thread-local statics in how it is declared, instead? So that program elements collaborating on accesses to this memory can reuse the same type. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like that there's only a single dynamic groupshared allocation though that's aliased by everything in a group, which is not at all how statics work. |
||
let ty::RawPtr(inner_ty, _) = result.layout.ty.kind() else { unreachable!() }; | ||
let alignment = self.align_of(*inner_ty).bytes() as u32; | ||
unsafe { | ||
if alignment > llvm::LLVMGetAlignment(global) { | ||
llvm::LLVMSetAlignment(global, alignment); | ||
} | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does this do when there are multiple translation units? It seems like each would declare its own version of the global, potentially with a different alignment? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All of them return the same pointer. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
By which magic does that happen? That's not usually how LLVM globals work, AFAIK.
That would be rather surprising to me; usually a global can only be declared in one translation unit and then must be imported in all the others. Your current implementation might end up with a separate symbol for each TU, I am not sure. EDIT: Ah I didn't realize these are
How does that work in rustc where each crate is a separate TU? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the key is that they are extern. I agree it would not make sense for non-extern globals.
We do (fat/full) LTO. All crates compile to bitcode, than that bitcode is linked and compiled as one LLVM IR module. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Usually an alignment attribute on an extern global encodes an assumption -- it's a promise to LLVM that someone else has aligned the global sufficiently. It looks like this here relies on some sort of magic handshake where whatever component actually creates the global will honor whatever alignment you can find in the bitcode? That's pretty fragile -- usually in LLVM it's always correct to reduce the alignment attribute on such declarations, it just means we know less about what the real alignment is. (This is in contrast to alignment annotations on definitions which defines the real alignment, so it can be safely increased but not reduced.) Seems like some system here is (ab)using LLVM attributes in ways that break some of their core properties, or am I misunderstanding something? Is this magic handshake documented anywhere? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (as mentioned elsewhere in the PR) This is where some of that handshake is implemented. |
||
self.cx().const_pointercast(global, self.type_ptr()) | ||
} | ||
|
||
_ if name.as_str().starts_with("simd_") => { | ||
// Unpack non-power-of-2 #[repr(packed, simd)] arguments. | ||
// This gives them the expected layout of a regular #[repr(simd)] vector. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3238,6 +3238,23 @@ pub(crate) const fn miri_promise_symbolic_alignment(ptr: *const (), align: usize | |
) | ||
} | ||
|
||
/// Returns a pointer to dynamic shared memory. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know this is a short description, but it is just repeating the function signature: pub fn dynamic_shared_memory<T: ?Sized>() -> *mut T;
/* ^^^^^^^^^^^^^^^^^^^^ ^^ ^^^^^^
| | |
| | |
-- "dynamic shared memory" | - "a pointer"
- "returns"
*/ I'm basically repeating the same comment I already made about "shared with what?" except with adding that "dynamic" also is pretty vague since in computer programming it's always relative to some notion of "static". Yet the "static" is not obviously present in this description. |
||
/// | ||
/// The returned pointer is the start of the dynamic shared memory region. | ||
/// All pointers returned by `dynamic_shared_memory` point to the same address, | ||
/// so alias the same memory. | ||
Comment on lines
+3244
to
+3245
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this mean that all There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All of them alias, independent of the type. Maybe this makes it clearer:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It's probably worth using a more specific term ("GPU shared memory" or so) since many people reading this will think "shared memory" refers to its more standard meaning of memory shared across different processes (often set up via There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This sort of "translation guide" is not actually useful if you are not familiar with any of these things, so I would just leave it out as it is a distraction from the actual description. Especially since it's very easy to go look up a definition of, say, OpenCL's local memory, see it referred to as "this is GLSL's shared memory", look up a definition of that and see it referred to as basically the same idea as groupshared memory in DirectX, then look up a definition of that and... you get the idea. Our definition should stand on its own. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The translation guide might be useful to people that are familiar with these things and wondering why we are making up our own terms. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then I will merely continue to insist that the description should make sense without reference to prevent infinite regress. Exceedingly fine details, of course, can be handled elsewhere by other sources, but the concepts should be usefully clear here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For sure, we need something reasonably concise such that a rust compiler dev with zero GPU knowledge has a rough idea of what this does after reading the description. I don't think that's that hard, GPUs aren't that special, but they use a lot of "weird" (read: grown-over-time) terminology that presents a big barrier to entry. |
||
/// The returned pointer is aligned by at least the alignment of `T`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Speaking of safety requirements... how does one use this pointer? I get that it is aligned, but does it point to enough memory to store a Typically, intrinsic documentations should be detailed enough that I can read and write code using the intrinsic and know exactly whether the code is correct and what it will do in all circumstances. I don't know if there's any hope of achieving that with GPU intrinsics, but if not then we need to have a bit of a wider discussion -- we have had bad experience with just importing "externally defined" semantics into Rust without considering all the interactions (in general, it is not logically coherent to have semantics externally defined). The current docs would let me implement this intrinsic by just always returning 1024, and emitting a compile error if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there some prior discussion of the design decision to determine the alignment by giving a type parameter? I could also be a const generic parameter, for instance. I don't have an opinion on the matter since I am an outsider to the GPU world, but as a compiler team member it'd be good to know if this is something you thought about for 5 minutes or whether there's some sort of larger design by a team that has a vision of how all these things will fit together. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is some discussion in #135516. I don’t mind either way, I thought (for 5 minutes ;)) that specifying the type of the returned pointer makes sense. For just a struct, static shared memory would make more sense, though we don’t support that yet (there’s some discussion in the tracking issue, but I think that’s more complicated to design and implement). |
||
/// | ||
/// # Other APIs | ||
/// | ||
/// CUDA and HIP call this shared memory. | ||
/// OpenCL and SYCL call this local memory. | ||
#[rustc_intrinsic] | ||
#[rustc_nounwind] | ||
#[unstable(feature = "dynamic_shared_memory", issue = "135513")] | ||
#[cfg(any(target_arch = "amdgpu", target_arch = "nvptx64"))] | ||
pub fn dynamic_shared_memory<T: ?Sized>() -> *mut T; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that outside the GPU world, "shared memory" typically refers to memory shared between processes. So I would suggest using a name that's less likely to be confused, like something that explicitly involves "GPU" or so. This sounds like a form of "global" memory (similar to a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it make sense to add a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or should it be in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rust intrinsic names are not namespaced. They are exposed in a module, but inside the compiler they are identified entirely by their name. So moving them into a different module doesn't alleviate the need for a clear name that will be understandable to non-GPU people working in the compiler (which is the vast majority of compiler devs). If there's more GPU intrinsics to come, moving them into a I don't have a strong opinion on how the eventually stable public API is organized, I am commenting entirely as someone who has an interest in keeping the set of intrinsics the Rust compiler offers understandable and well-defined (the ones in this folder, not the ones in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The |
||
|
||
/// Copies the current location of arglist `src` to the arglist `dst`. | ||
/// | ||
/// FIXME: document safety requirements | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
// Checks that dynamic_shared_memory works. | ||
|
||
//@ revisions: amdgpu nvptx x86 | ||
//@ compile-flags: --crate-type=rlib | ||
// | ||
//@ [amdgpu] compile-flags: --target amdgcn-amd-amdhsa -Ctarget-cpu=gfx900 | ||
//@ [amdgpu] only-amdgpu | ||
//@ [amdgpu] needs-llvm-components: amdgpu | ||
//@ [nvptx] compile-flags: --target nvptx64-nvidia-cuda | ||
//@ [nvptx] only-nvptx64 | ||
//@ [nvptx] needs-llvm-components: nvptx | ||
//@ [x86] compile-flags: --target x86_64-unknown-linux-gnu | ||
//@ [x86] only-x86_64 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not enough when running on x86_64 Windows. You did need |
||
//@ [x86] needs-llvm-components: x86 | ||
//@ [x86] should-fail | ||
#![feature(core_intrinsics, dynamic_shared_memory)] | ||
#![no_std] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would using mini-core instead of the precompiled core in the sysroot work instead of hard-coding There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think hardcoding gfx900 as the target will work at all, different uarches can have wildly different codegen'd binaries and are usually not backwards compatible. I think this is a case where we want the user to specify the target architecture that they are compiling for. This would probably work for smoke testing but in the long term it's probably going to cause more pains than it's worth. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that this is just about local testing that the compiler works at all, none of the binary compatibility concerns really matter to us. This code is not executed. But I agree that we should probably just use the minicore for this instead of adding code to the build system that we might have to rip out later. It's easy to update tests... unless they are built on a fragile assumption in the build system. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I could probably duplicate the intrinsic in minicore, but wouldn’t that defeat the purpose of this test?
Sure, that’s what the target enforces :) Unfortunately it’s the exact thing that throws a wrench into writing tests for the standard library. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Duplicating the intrinsic declaration is fine; what you are testing is the implementation and that is not being duplicated. |
||
|
||
use core::intrinsics::dynamic_shared_memory; | ||
|
||
// CHECK: @dynamic_shared_memory = external addrspace(3) global [0 x i8], align 8 | ||
// CHECK: ret ptr addrspacecast (ptr addrspace(3) @dynamic_shared_memory to ptr) | ||
pub fn fun() -> *mut i32 { | ||
let res = dynamic_shared_memory::<i32>(); | ||
dynamic_shared_memory::<f64>(); // Increase alignment to 8 | ||
res | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"shared" between whomst?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concept is called “shared memory”, so I meant it as a reference to that concept, not as an explanation ;)
I’ll make this more descriptive (though as at stands, we’ll first need to decide on nomenclature to use throughout the Rust codebase).