Skip to content

Add more SIMD platform-intrinsics #117953

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 9, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 51 additions & 1 deletion compiler/rustc_codegen_cranelift/src/intrinsics/simd.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
//! Codegen `extern "platform-intrinsic"` intrinsics.

use cranelift_codegen::ir::immediates::Offset32;
use rustc_middle::ty::GenericArgsRef;
use rustc_span::Symbol;
use rustc_target::abi::Endian;
Expand Down Expand Up @@ -1008,8 +1009,57 @@ pub(super) fn codegen_simd_intrinsic_call<'tcx>(
}
}

sym::simd_masked_load => {
intrinsic_args!(fx, args => (mask, ptr, val); intrinsic);

let (val_lane_count, val_lane_ty) = val.layout().ty.simd_size_and_type(fx.tcx);
let (mask_lane_count, _mask_lane_ty) = mask.layout().ty.simd_size_and_type(fx.tcx);
let (ret_lane_count, ret_lane_ty) = ret.layout().ty.simd_size_and_type(fx.tcx);
assert_eq!(val_lane_count, mask_lane_count);
assert_eq!(val_lane_count, ret_lane_count);

let lane_clif_ty = fx.clif_type(val_lane_ty).unwrap();
let ret_lane_layout = fx.layout_of(ret_lane_ty);
let ptr_val = ptr.load_scalar(fx);

for lane_idx in 0..ret_lane_count {
let val_lane = val.value_lane(fx, lane_idx).load_scalar(fx);
let mask_lane = mask.value_lane(fx, lane_idx).load_scalar(fx);

let if_enabled = fx.bcx.create_block();
let if_disabled = fx.bcx.create_block();
let next = fx.bcx.create_block();
let res_lane = fx.bcx.append_block_param(next, lane_clif_ty);

fx.bcx.ins().brif(mask_lane, if_enabled, &[], if_disabled, &[]);
fx.bcx.seal_block(if_enabled);
fx.bcx.seal_block(if_disabled);

fx.bcx.switch_to_block(if_enabled);
let offset = lane_idx as i32 * lane_clif_ty.bytes() as i32;
let res = fx.bcx.ins().load(
lane_clif_ty,
MemFlags::trusted(),
ptr_val,
Offset32::new(offset),
);
fx.bcx.ins().jump(next, &[res]);

fx.bcx.switch_to_block(if_disabled);
fx.bcx.ins().jump(next, &[val_lane]);

fx.bcx.seal_block(next);
fx.bcx.switch_to_block(next);

fx.bcx.ins().nop();

ret.place_lane(fx, lane_idx)
.write_cvalue(fx, CValue::by_val(res_lane, ret_lane_layout));
}
}

sym::simd_scatter => {
intrinsic_args!(fx, args => (val, ptr, mask); intrinsic);
intrinsic_args!(fx, args => (mask, ptr, val); intrinsic);

let (val_lane_count, _val_lane_ty) = val.layout().ty.simd_size_and_type(fx.tcx);
let (ptr_lane_count, _ptr_lane_ty) = ptr.layout().ty.simd_size_and_type(fx.tcx);
Expand Down
192 changes: 192 additions & 0 deletions compiler/rustc_codegen_llvm/src/intrinsic.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1492,6 +1492,198 @@ fn generic_simd_intrinsic<'ll, 'tcx>(
return Ok(v);
}

if name == sym::simd_masked_load {
// simd_masked_load(mask: <N x i{M}>, pointer: *_ T, values: <N x T>) -> <N x T>
// * N: number of elements in the input vectors
// * T: type of the element to load
// * M: any integer width is supported, will be truncated to i1
// Loads contiguous elements from memory behind `pointer`, but only for
// those lanes whose `mask` bit is enabled.
// The memory addresses corresponding to the “off” lanes are not accessed.

// The element type of the "mask" argument must be a signed integer type of any width
let mask_ty = in_ty;
let (mask_len, mask_elem) = (in_len, in_elem);

// The second argument must be a pointer matching the element type
let pointer_ty = arg_tys[1];

// The last argument is a passthrough vector providing values for disabled lanes
let values_ty = arg_tys[2];
let (values_len, values_elem) = require_simd!(values_ty, SimdThird);

require_simd!(ret_ty, SimdReturn);

// Of the same length:
require!(
values_len == mask_len,
InvalidMonomorphization::ThirdArgumentLength {
span,
name,
in_len: mask_len,
in_ty: mask_ty,
arg_ty: values_ty,
out_len: values_len
}
);

// The return type must match the last argument type
require!(
ret_ty == values_ty,
InvalidMonomorphization::ExpectedReturnType { span, name, in_ty: values_ty, ret_ty }
);

require!(
matches!(
pointer_ty.kind(),
ty::RawPtr(p) if p.ty == values_elem && p.ty.kind() == values_elem.kind()
),
InvalidMonomorphization::ExpectedElementType {
span,
name,
expected_element: values_elem,
second_arg: pointer_ty,
in_elem: values_elem,
in_ty: values_ty,
mutability: ExpectedPointerMutability::Not,
}
);

require!(
matches!(mask_elem.kind(), ty::Int(_)),
InvalidMonomorphization::ThirdArgElementType {
span,
name,
expected_element: values_elem,
third_arg: mask_ty,
}
);

// Alignment of T, must be a constant integer value:
let alignment_ty = bx.type_i32();
let alignment = bx.const_i32(bx.align_of(values_ty).bytes() as i32);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be bx.align_of(values_elem).bytes() not values_ty. Technically, the LLVM intrinsic accepts any power of two as alignment, so we could relax this down to 1.

https://rust-lang.github.io/unsafe-code-guidelines/layout/packed-simd-vectors.html#packed-simd-vector-types

Am I understanding this correctly or does it not make a difference @workingjubilee?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having checked the Arm ARM, RISCV V spec, and Intel SDM, we should continue to require element alignment at least. We can expect any masked-load-supporting implementation to support masked loads and stores on element boundaries. However, they may reject accesses on byte boundaries (except where the element is u8, of course), or they may just implement the load/store inefficiently if unaligned, which is honestly about as bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are definitely reasons to do unaligned loads (deserialization, packet framing, etc) that benefit from vectorization, so I think we should at least make writing {}_unaligned versions possible. Though I can't remember, what's the requirement for scatter/gather?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, element alignment.

Copy link
Member

@workingjubilee workingjubilee Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can adjust them to take a const parameter for alignment, if we want, I guess?

Copy link
Member

@RalfJung RalfJung Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can only guess the intent, but I am fairly sure about what Simd::load currently does. :)

If it is intended to allow entirely unaligned pointers, it should be implemented as ptr.cast::<Self>().read_unaligned(). Though I guess that would run into issues with non-power-of-2 vectors as the comment indicates, but those aren't properly supported currently anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened rust-lang/portable-simd#382 for the load question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will open a PR to fix the alignment passed down to LLVM IR. The optimizer converts the masked load to an umasked, aligned load because the alignment is too high

https://rust.godbolt.org/z/KEeGbevbb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with element alignment for now, does this require a code change? The documentation makes it seem like bx.align_of(values_ty) is not guaranteed to equal bx.align_of(values_elem) (implementation-defined), so this line should probably be corrected?

Correct.

I don't have opinions either way here. It's just important that we pick a rule and then apply it consistently everywhere. (Which is also IMO why we shouldn't land new intrinsics without documentation of their exact safety requirements. Just because portable SIMD intrinsics got a pass on this in the past doesn't mean we should continue doing that.)

I certainly wasn't intending on being freewheeling here, or I would not have asked a zillion small questions myself. :^) The problem is slightly of the "remembering all the questions I should be asking" nature, especially when there has already been multiple rounds of review.

I guess the main list of questions to keep in mind is:

  1. For memory (i.e. pointer) operations: What is the alignment asserted by the op? (And for SIMD types, you should almost always be picking the alignment of the element or the alignment of the vector.)
  2. For memory operations, is there a condition in which that alignment is not asserted by the intrinsic? (Also, please answer no, especially if there is any doubt about how LLVM will interpret it.)
  3. For memory operations, what happens if you operate on uninit memory? It's UB, right?
  4. For masked operations, what happens for masks derived from alternating [true, false, ..]? (i.e. "what is the normal behavior?")
  5. For masked operations, what happens on all-1s? Is there any special event, or does it strengthen any assertions?
  6. For masked operations, what happens on all-0s? Is there any special event, or does it weaken any assertions?
  7. For indexing operations, what happens when your index is "out-of-bounds"?
  8. Sooo... have you checked the LLVMIR for the answers to all of the above questions? How about after optimizations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For memory operations, what happens if you operate on uninit memory? It's UB, right?

Note that we do not ever treat uninit memory specially for memory operations in Rust. The UB on uninit derives entirely from the fact that uninit memory violates the validity invariant of most types. So the question that should be asked is: is this an untyped operation, working on raw abstract machine bytes, or is it a typed operation, and at which type?


// Truncate the mask vector to a vector of i1s:
let (mask, mask_ty) = {
let i1 = bx.type_i1();
let i1xn = bx.type_vector(i1, mask_len);
(bx.trunc(args[0].immediate(), i1xn), i1xn)
};

let llvm_pointer = bx.type_ptr();

// Type of the vector of elements:
let llvm_elem_vec_ty = llvm_vector_ty(bx, values_elem, values_len);
let llvm_elem_vec_str = llvm_vector_str(bx, values_elem, values_len);

let llvm_intrinsic = format!("llvm.masked.load.{llvm_elem_vec_str}.p0");
let fn_ty = bx
.type_func(&[llvm_pointer, alignment_ty, mask_ty, llvm_elem_vec_ty], llvm_elem_vec_ty);
let f = bx.declare_cfn(&llvm_intrinsic, llvm::UnnamedAddr::No, fn_ty);
let v = bx.call(
fn_ty,
None,
None,
f,
&[args[1].immediate(), alignment, mask, args[2].immediate()],
None,
);
return Ok(v);
}

if name == sym::simd_masked_store {
// simd_masked_store(mask: <N x i{M}>, pointer: *mut T, values: <N x T>) -> ()
// * N: number of elements in the input vectors
// * T: type of the element to load
// * M: any integer width is supported, will be truncated to i1
// Stores contiguous elements to memory behind `pointer`, but only for
// those lanes whose `mask` bit is enabled.
// The memory addresses corresponding to the “off” lanes are not accessed.

// The element type of the "mask" argument must be a signed integer type of any width
let mask_ty = in_ty;
let (mask_len, mask_elem) = (in_len, in_elem);

// The second argument must be a pointer matching the element type
let pointer_ty = arg_tys[1];

// The last argument specifies the values to store to memory
let values_ty = arg_tys[2];
let (values_len, values_elem) = require_simd!(values_ty, SimdThird);

// Of the same length:
require!(
values_len == mask_len,
InvalidMonomorphization::ThirdArgumentLength {
span,
name,
in_len: mask_len,
in_ty: mask_ty,
arg_ty: values_ty,
out_len: values_len
}
);

// The second argument must be a mutable pointer type matching the element type
require!(
matches!(
pointer_ty.kind(),
ty::RawPtr(p) if p.ty == values_elem && p.ty.kind() == values_elem.kind() && p.mutbl.is_mut()
),
InvalidMonomorphization::ExpectedElementType {
span,
name,
expected_element: values_elem,
second_arg: pointer_ty,
in_elem: values_elem,
in_ty: values_ty,
mutability: ExpectedPointerMutability::Mut,
}
);

require!(
matches!(mask_elem.kind(), ty::Int(_)),
InvalidMonomorphization::ThirdArgElementType {
span,
name,
expected_element: values_elem,
third_arg: mask_ty,
}
);

// Alignment of T, must be a constant integer value:
let alignment_ty = bx.type_i32();
let alignment = bx.const_i32(bx.align_of(values_elem).bytes() as i32);

// Truncate the mask vector to a vector of i1s:
let (mask, mask_ty) = {
let i1 = bx.type_i1();
let i1xn = bx.type_vector(i1, in_len);
(bx.trunc(args[0].immediate(), i1xn), i1xn)
};

let ret_t = bx.type_void();

let llvm_pointer = bx.type_ptr();

// Type of the vector of elements:
let llvm_elem_vec_ty = llvm_vector_ty(bx, values_elem, values_len);
let llvm_elem_vec_str = llvm_vector_str(bx, values_elem, values_len);

let llvm_intrinsic = format!("llvm.masked.store.{llvm_elem_vec_str}.p0");
let fn_ty = bx.type_func(&[llvm_elem_vec_ty, llvm_pointer, alignment_ty, mask_ty], ret_t);
let f = bx.declare_cfn(&llvm_intrinsic, llvm::UnnamedAddr::No, fn_ty);
let v = bx.call(
fn_ty,
None,
None,
f,
&[args[2].immediate(), args[1].immediate(), alignment, mask],
None,
);
return Ok(v);
}

if name == sym::simd_scatter {
// simd_scatter(values: <N x T>, pointers: <N x *mut T>,
// mask: <N x i{M}>) -> ()
Expand Down
2 changes: 2 additions & 0 deletions compiler/rustc_hir_analysis/src/check/intrinsic.rs
Original file line number Diff line number Diff line change
Expand Up @@ -521,6 +521,8 @@ pub fn check_platform_intrinsic_type(tcx: TyCtxt<'_>, it: &hir::ForeignItem<'_>)
sym::simd_fpowi => (1, 0, vec![param(0), tcx.types.i32], param(0)),
sym::simd_fma => (1, 0, vec![param(0), param(0), param(0)], param(0)),
sym::simd_gather => (3, 0, vec![param(0), param(1), param(2)], param(0)),
sym::simd_masked_load => (3, 0, vec![param(0), param(1), param(2)], param(2)),
sym::simd_masked_store => (3, 0, vec![param(0), param(1), param(2)], Ty::new_unit(tcx)),
sym::simd_scatter => (3, 0, vec![param(0), param(1), param(2)], Ty::new_unit(tcx)),
sym::simd_insert => (2, 0, vec![param(0), tcx.types.u32, param(1)], param(0)),
sym::simd_extract => (2, 0, vec![param(0), tcx.types.u32], param(1)),
Expand Down
2 changes: 2 additions & 0 deletions compiler/rustc_span/src/symbol.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1516,6 +1516,8 @@ symbols! {
simd_insert,
simd_le,
simd_lt,
simd_masked_load,
simd_masked_store,
simd_mul,
simd_ne,
simd_neg,
Expand Down
34 changes: 34 additions & 0 deletions tests/codegen/simd-intrinsic/simd-intrinsic-generic-masked-load.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
// compile-flags: -C no-prepopulate-passes

#![crate_type = "lib"]

#![feature(repr_simd, platform_intrinsics)]
#![allow(non_camel_case_types)]

#[repr(simd)]
#[derive(Copy, Clone, PartialEq, Debug)]
pub struct Vec2<T>(pub T, pub T);

#[repr(simd)]
#[derive(Copy, Clone, PartialEq, Debug)]
pub struct Vec4<T>(pub T, pub T, pub T, pub T);

extern "platform-intrinsic" {
fn simd_masked_load<M, P, T>(mask: M, pointer: P, values: T) -> T;
}

// CHECK-LABEL: @load_f32x2
#[no_mangle]
pub unsafe fn load_f32x2(mask: Vec2<i32>, pointer: *const f32,
values: Vec2<f32>) -> Vec2<f32> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of "values" here? It seems strange that a load is told the values to load...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the fallback/passthrough values used for lanes that are masked off, so have the corresponding bits in mask all zeros

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the name values doesn't really convey that. Thanks!

// CHECK: call <2 x float> @llvm.masked.load.v2f32.p0(ptr {{.*}}, i32 {{.*}}, <2 x i1> {{.*}}, <2 x float> {{.*}})
simd_masked_load(mask, pointer, values)
}

// CHECK-LABEL: @load_pf32x4
#[no_mangle]
pub unsafe fn load_pf32x4(mask: Vec4<i32>, pointer: *const *const f32,
values: Vec4<*const f32>) -> Vec4<*const f32> {
// CHECK: call <4 x ptr> @llvm.masked.load.v4p0.p0(ptr {{.*}}, i32 {{.*}}, <4 x i1> {{.*}}, <4 x ptr> {{.*}})
simd_masked_load(mask, pointer, values)
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
// compile-flags: -C no-prepopulate-passes

#![crate_type = "lib"]

#![feature(repr_simd, platform_intrinsics)]
#![allow(non_camel_case_types)]

#[repr(simd)]
#[derive(Copy, Clone, PartialEq, Debug)]
pub struct Vec2<T>(pub T, pub T);

#[repr(simd)]
#[derive(Copy, Clone, PartialEq, Debug)]
pub struct Vec4<T>(pub T, pub T, pub T, pub T);

extern "platform-intrinsic" {
fn simd_masked_store<M, P, T>(mask: M, pointer: P, values: T) -> ();
}

// CHECK-LABEL: @store_f32x2
#[no_mangle]
pub unsafe fn store_f32x2(mask: Vec2<i32>, pointer: *mut f32, values: Vec2<f32>) {
// CHECK: call void @llvm.masked.store.v2f32.p0(<2 x float> {{.*}}, ptr {{.*}}, i32 {{.*}}, <2 x i1> {{.*}})
simd_masked_store(mask, pointer, values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the alignment requirements on the pointer for the store to be valid? (And same question for the load.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pointer should be aligned to the individual element, not the whole vector. So when loading f32x2, size_of is 8, but alignment of f32 is 4, therefore the pointer should be aligned to 4. This is both for load & store

Copy link
Member

@RalfJung RalfJung Dec 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the mask is all-0, i.e., no load/store actually happens. Is the pointer still required to be aligned?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with alignment checking, masked elements never generate an access for the load, thus no hardware implementation requires it. If you have a strong reasoning for arguing otherwise, we could make it UB regardless?

Otherwise: no, it's a non-event, like unsafe { invalid_ptr.add(0) }.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw x86-64 for these doesn't ever generate an alignment check exception even if you turn alignment checking on, while the others are explicit that the logic is that predication prevents the access, therefore the alignment check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe everything that's been raised so far has been addressed. Thanks @RalfJung and apologies if my answers earlier were unhelpful. I wasn't aware that the alignment requirement is only relaxed through the from_array/to_array path, relying on the optimizer to elide those later.

Copy link
Member

@RalfJung RalfJung Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@farnoy can you link to the PR(s) that address this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly I've been asking for documentation. #118853 addresses that, please double-check there that the docs for these new intrinsics match what you implemented now. :)

}

// CHECK-LABEL: @store_pf32x4
#[no_mangle]
pub unsafe fn store_pf32x4(mask: Vec4<i32>, pointer: *mut *const f32, values: Vec4<*const f32>) {
// CHECK: call void @llvm.masked.store.v4p0.p0(<4 x ptr> {{.*}}, ptr {{.*}}, i32 {{.*}}, <4 x i1> {{.*}})
simd_masked_store(mask, pointer, values)
}
Loading