Skip to content

Conversation

JDPailleux
Copy link
Member

Hello,

This PR is a discussion around a dialect to define “coarrray” operations in Flang. This is a draft and a few operations have been proposed to start with and see if they might be suitable.

Copy link

@bonachea bonachea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JDPailleux I'm not sure why this was moved from PR #3 without explanation, but the only difference seems to be the branch names. I've copied over my initial high-level comments from #3.

I have plenty of other feedback on the actual content of the dialect, but before diving into that I was hoping for some clarity on the intended scope and name of the dialect.

Is this dialect intended to capture all of Fortran's multi-image functionality (i.e. everything that gets lowered into a PRIF call)? Or is it really intended to just capture coarrays (in which case large sections of the current draft seem out-of-scope)?

}

//===----------------------------------------------------------------------===//
// Collective Operations

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do Collective Subroutines really belong in this dialect?

These are often conflated during informal conversations, but Collective Subroutines have absolutely nothing to do with coarrays. They are a completely orthogonal multi-image feature. In fact the argument to collective subroutines is even prohibited from being a coindexed object.

I'm not sure these belong in ANY dialect - for most static purposes, collective subroutines are just a regular intrinsic subroutine. IMHO they don't belong lumped into a coarray dialect any more than say, the SQRT intrinsic.

}

//===----------------------------------------------------------------------===//
// Image Queries

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The STOP and FAIL IMAGE statements and intrinsics NUM_IMAGES, STOPPED_IMAGES, FAILED_IMAGES and IMAGE_STATUS also have nothing to do with coarrays. Neither does SYNC ALL, nor (arguably) all the TEAM support routines.

Perhaps this dialect should be renamed to the multi-image dialect? That name could encompass all the features related to multi-image parallel execution, of which coarrays are an important subset.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 3fef235

@JDPailleux
Copy link
Member Author

Hi @bonachea , following your feedback and what was mentioned during the committee meeting, I renamed the dialect to Multi-Image (MIF) dialect because it make more sense.

Copy link

@bonachea bonachea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great start on this dialect!

I made many comments/requests. These include may simple suggestions that can probably be applied immediately (I highly recommend using the "commit suggestion" button in the GitHub UI for this purpose, it saves alot of typing and is less error-prone than manual application). There are also a few comments that may require more sweeping changes and possibly further discussion.

I will also likely have more feedback once we start to address some of these issues I've raised.

Copy link

@bonachea bonachea Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several operations seem to be notably missing from this dialect:

  1. non-coarray ALLOCATE
  2. MOVE_ALLOC with coarray arguments

I assume these pre-existing serial features already exists as operations elsewhere in HLFIR/MLIR, but they will need to be lowered to PRIF calls when compiling in multi-image mode.

How do we propose to handle such operations?

}

//===----------------------------------------------------------------------===//
// Image Queries

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 3fef235

@JDPailleux JDPailleux changed the title [DRAFT][flang][Coarray] Dialect Coarray operations [DRAFT][flang] Multi-Image Fortran (MIF) Dialect Aug 1, 2025
@JDPailleux
Copy link
Member Author

Hi @bonachea, sorry for the delay.
Thank you for all feedback and sorry for the many careless mistakes and confusions. I've updated the code with some of your comments.

Copy link

@bonachea bonachea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @JDPailleux - Thanks for all the improvements!

I've made a few additional comments on open threads and added a few more suggestions.

I again highly recommend using the "commit suggestion" button in the GitHub UI for suggestions you agree with: this wonderful button saves alot of typing and is less error-prone than manual application. There is also an "Add suggestion to batch" button that can be used to batch suggestions into a single commit.

Also a few comments from the first review remain unresolved.

Comment on lines +684 to +695
- `atom`: Shall be a coarray of type integer with kind
ATOMIC_INT_KIND or ATOMIC_LOGICAL_KIND from module ISO_FORTRAN_ENV
- `old`, `new` and `compared`: Shall be a scalar integer with kind ATOMIC_LOGICAL_KIND.
They need to have the same type and kind as `atom`.
}];

let arguments = (ins AnyIntegerType:$image_num,
AnyIntegerType:$atom,
Arg<AnyIntegerType, "", [MemWrite]>:$old,
Arg<AnyIntegerType, "", [MemRead]>:$compare,
Arg<AnyIntegerType, "", [MemRead]>:$new_val,
Arg<Optional<AnyRefOrBoxType>, "", [MemWrite]>:$errmsg);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As previously mentioned, ATOMIC_LOGICAL_KIND is not an integer kind. Strictly speaking the two atomic types in Fortran are:

  • integer(ATOMIC_INT_KIND) and
  • logical(ATOMIC_LOGICAL_KIND)

It may be that flang internally conflates integer and logical (not sure?), but I think at least in descriptions we should try to express the types correctly.

In the suggestion below I'm unsure whether AnyIntegerType will encompass logical(ATOMIC_LOGICAL_KIND), so this might have to be something more generic.

Suggested change
- `atom`: Shall be a coarray of type integer with kind
ATOMIC_INT_KIND or ATOMIC_LOGICAL_KIND from module ISO_FORTRAN_ENV
- `old`, `new` and `compared`: Shall be a scalar integer with kind ATOMIC_LOGICAL_KIND.
They need to have the same type and kind as `atom`.
}];
let arguments = (ins AnyIntegerType:$image_num,
AnyIntegerType:$atom,
Arg<AnyIntegerType, "", [MemWrite]>:$old,
Arg<AnyIntegerType, "", [MemRead]>:$compare,
Arg<AnyIntegerType, "", [MemRead]>:$new_val,
Arg<Optional<AnyRefOrBoxType>, "", [MemWrite]>:$errmsg);
- `atom`: Shall be a coarray of type integer with kind
ATOMIC_INT_KIND or of type logical with kind ATOMIC_LOGICAL_KIND
from module ISO_FORTRAN_ENV
- `old`, `new` and `compared`: Shall be a scalars with the same type and kind as `atom`.
}];
let arguments = (ins AnyIntegerType:$image_num,
AnyIntegerType:$atom,
Arg<AnyIntegerType, "", [MemWrite]>:$old,
Arg<AnyIntegerType, "", [MemRead]>:$compare,
Arg<AnyIntegerType, "", [MemRead]>:$new_val,
Arg<Optional<AnyRefOrBoxType>, "", [MemWrite]>:$errmsg);

The same comment also applies to mif_AtomicDefineOp and mif_AtomicRefOp, the two other operations that accept both integer or logical arguments.

Comment on lines +742 to +743
The co_broadcast operation performs the computation of the sum
across images.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

co_broadcast is a data movement collective, not a computational collective. It does not perform a "sum" (or any other ALU computation).

Suggested change
The co_broadcast operation performs the computation of the sum
across images.
The co_broadcast operation broadcasts a value across images.

def mif_CoReduceOp : mif_Op<"co_reduce", [AttrSizedOperandSegments]> {
let summary = "Generalized reduction across images.";
let description = [{
The co_sum operation performs the reduction across images.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The co_sum operation performs the reduction across images.
The co_reduce operation performs the reduction across images.

"Performs a collective synchronization of all images in the current team";

let arguments = (ins Arg<Optional<AnyRefOrBoxType>, "", [MemWrite]>:$errmsg);
let results = (outs AnyIntegerType:$stat);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully understand your answer.

My main concern is to ensure we faithfully represent the presence of absence of the optional STAT= argument from the user's original code in the dialect, so that when it comes time to lower this operation to a PRIF call you can match the user's choice to provide STAT or not in the corresponding PRIF call.

I should note this detail only matters when an error occurs, and maybe we initially don't care about error behavior. But if we expect to some day provide spec-compliant error behavior, then we probably need to preserve that information somehow in the IR.

let summary = "Performs a synchronization of image with each of the other "
"images in the `image_set`";

let arguments = (ins fir_SequenceType:$image_set,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The F23 standard does not actually require the image_set to be non-empty:

image

but of course you can pass an empty rank-one array of integers, e.g.:

PROGRAM SYNCIMGS
  integer, dimension(1:0) :: empty_array

  SYNC IMAGES(*)
  SYNC IMAGES(THIS_IMAGE())
  SYNC IMAGES([INTEGER ::])
  SYNC IMAGES(empty_array)
  print *, THIS_IMAGE(), size(empty_array)
END

I've just verified this program compiles and works with Cray Fortran. Both flang/trunk and the SiPearl fork accept this program without compile-time errors (and in general one cannot statically prove an array variable is non-empty).

However an empty input array (SYNC IMAGES([INTEGER ::])) means "synchronize with no images" whereas SYNC IMAGES(*) means the opposite "synchronize with all the images". For this reason PRIF makes the image_set optional and uses the absence of the argument to mean *, a state which is syntactically and semantically distinct from a provided empty array.

Comment on lines +385 to +386
- `image_num`: Shall be an integer that correspond to the image index of
the lock variable to be locked.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so in that case for completeness I think we need to introduce a dialect operation to perform that image translation.

The semantics are identical to prif_initial_team_index, but of course we don't have to call it that.

Comment on lines +387 to +388
- `lock_var` : Shall be a coarray of type LOCK_TYPE from the intrinsic
module ISO_FORTRAN_ENV.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely understand we are trying to match Fortran and not PRIF here, but in most cases the design of PRIF was determined by what's possible to express in Fortran.

My specific concern here is that the user code might not always match the simple case:

LOCK(coarray[image_num])

This is definitely a common/important case, but in general I believe Fortran also allows more complicated statements like:

LOCK(some_var%some_component(2)%coarray[image_num]%some_component(1,4,7)%lock_var(2))

In other words, the argument to LOCK can be any valid coindexed expression that leads to a LOCK_TYPE variable. The same is true for UNLOCK, also EVENT POST (with EVENT_TYPE) and ATOMIC_* (with appropriate atomic type).

You can even see an example of this in F23 sec C.7.9:

 EVENT POST (worker[1]%free(THIS_IMAGE()))

Does fir_BoxType preserve sufficient information about the full coindexed expression used to reach the appropriate variable? If not then we may eventually have to address the general case more explicitly in the dialect.

let arguments = (ins Optional<BoolLike>:$quiet,
Optional<AnyIntegerType>:$stop_code_int,
Arg<Optional<AnyRefOrBoxType>, "", [MemWrite]>:$stop_code_char,
UnitAttr:$is_error);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I didn't realize there was an LLVM precedent for conflating these two separate statements.

I just wanted to note that while STOP and ERROR STOP have very similar runtime behavior in single-image execution, they have very different runtime behavior in multi-image execution. So we may eventually want this difference to be more visible in the dialect (to control analysis of surrounding code), but I guess it's fine this way for now provided we document it.

JDPailleux pushed a commit that referenced this pull request Aug 14, 2025
## Problem

When the new setting

```
set target.parallel-module-load true
```
was added, lldb began fetching modules from the devices from multiple
threads simultaneously. This caused crashes of lldb when debugging on
android devices.

The top of the stack in the crash look something like this:
```
#0 0x0000555aaf2b27fe llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/opt/llvm/bin/lldb-dap+0xb87fe)
 #1 0x0000555aaf2b0a99 llvm::sys::RunSignalHandlers() (/opt/llvm/bin/lldb-dap+0xb6a99)
 #2 0x0000555aaf2b2fda SignalHandler(int, siginfo_t*, void*) (/opt/llvm/bin/lldb-dap+0xb8fda)
 #3 0x00007f9c02444560 __restore_rt /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/libc_sigaction.c:13:0
 #4 0x00007f9c04ea7707 lldb_private::ConnectionFileDescriptor::Disconnect(lldb_private::Status*) (usr/bin/../lib/liblldb.so.15+0x22a7707)
 #5 0x00007f9c04ea5b41 lldb_private::ConnectionFileDescriptor::~ConnectionFileDescriptor() (usr/bin/../lib/liblldb.so.15+0x22a5b41)
 #6 0x00007f9c04ea5c1e lldb_private::ConnectionFileDescriptor::~ConnectionFileDescriptor() (usr/bin/../lib/liblldb.so.15+0x22a5c1e)
 llvm#7 0x00007f9c052916ff lldb_private::platform_android::AdbClient::SyncService::Stat(lldb_private::FileSpec const&, unsigned int&, unsigned int&, unsigned int&) (usr/bin/../lib/liblldb.so.15+0x26916ff)
 llvm#8 0x00007f9c0528b9dc lldb_private::platform_android::PlatformAndroid::GetFile(lldb_private::FileSpec const&, lldb_private::FileSpec const&) (usr/bin/../lib/liblldb.so.15+0x268b9dc)
```
Our workaround was to set `set target.parallel-module-load ` to `false`
to avoid the crash.

## Background

PlatformAndroid creates two different classes with one stateful adb
connection shared between the two -- one through AdbClient and another
through AdbClient::SyncService. The connection management and state is
complex, and seems to be responsible for the segfault we are seeing. The
AdbClient code resets these connections at times, and re-establishes
connections if they are not active. Similarly, PlatformAndroid caches
its SyncService, which uses an AdbClient class, but the SyncService puts
its connection into a different 'sync' state that is incompatible with a
standard connection.

## Changes in this diff

* This diff refactors the code to (hopefully) have clearer ownership of
the connection, clearer separation of AdbClient and SyncService by
making a new class for clearer separations of concerns, called
AdbSyncService.
* New unit tests are added
* Additional logs were added (see
llvm#145382 (comment)
for details)
JDPailleux pushed a commit that referenced this pull request Aug 14, 2025
…namic (llvm#153420)

Canonicalizing the following IR:

```
func.func @mul_zero_dynamic_nofold(%arg0: tensor<?x17xf32>) -> tensor<?x17xf32> {
  %0 = "tosa.const"() <{values = dense<0.000000e+00> : tensor<1x1xf32>}> : () -> tensor<1x1xf32>
  %1 = "tosa.const"() <{values = dense<0> : tensor<1xi8>}> : () -> tensor<1xi8>
  %2 = tosa.mul %arg0, %0, %1 : (tensor<?x17xf32>, tensor<1x1xf32>, tensor<1xi8>) -> tensor<?x17xf32>
  return %2 : tensor<?x17xf32>
}
```

resulted in a crash

```
#0 0x000056513187e8db backtrace (./build-release/bin/mlir-opt+0x9d698db)                                                                                                                                                                                                                                                                                                                   
 #1 0x0000565131b17737 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /local-ssd/sayans/Softwares/llvm-repo/llvm-project-latest/llvm/lib/Support/Unix/Signals.inc:838:8                                                                                                                                                                                                                
 #2 0x0000565131b187f3 PrintStackTraceSignalHandler(void*) /local-ssd/sayans/Softwares/llvm-repo/llvm-project-latest/llvm/lib/Support/Unix/Signals.inc:918:1                                                                                                                                                                                                                                
 #3 0x0000565131b18c30 llvm::sys::RunSignalHandlers() /local-ssd/sayans/Softwares/llvm-repo/llvm-project-latest/llvm/lib/Support/Signals.cpp:105:18                                                                                                                                                                                                                                         
 #4 0x0000565131b18c30 SignalHandler(int, siginfo_t*, void*) /local-ssd/sayans/Softwares/llvm-repo/llvm-project-latest/llvm/lib/Support/Unix/Signals.inc:409:3                                                                                                                                                                                                                              
 #5 0x00007f2e4165b050 (/lib/x86_64-linux-gnu/libc.so.6+0x3c050)                                                                                                                                                                                                                                                                                                                            
 #6 0x00007f2e416a9eec __pthread_kill_implementation ./nptl/pthread_kill.c:44:76                                                                                                                                                                                                                                                                                                            
 llvm#7 0x00007f2e4165afb2 raise ./signal/../sysdeps/posix/raise.c:27:6                                                                                                                                                                                                                                                                                                                         
 llvm#8 0x00007f2e41645472 abort ./stdlib/abort.c:81:7                                                                                                                                                                                                                                                                                                                                          
 llvm#9 0x00007f2e41645395 _nl_load_domain ./intl/loadmsgcat.c:1177:9                                                                                                                                                                                                                                                                                                                           
llvm#10 0x00007f2e41653ec2 (/lib/x86_64-linux-gnu/libc.so.6+0x34ec2)                                                                                                                                                                                                                                                                                                                            
llvm#11 0x00005651443ec4ba mlir::DenseIntOrFPElementsAttr::getRaw(mlir::ShapedType, llvm::ArrayRef<char>) /local-ssd/sayans/Softwares/llvm-repo/llvm-project-latest/mlir/lib/IR/BuiltinAttributes.cpp:1361:3                                                                                                                                                                                    
llvm#12 0x00005651443f1209 mlir::DenseElementsAttr::resizeSplat(mlir::ShapedType) /local-ssd/sayans/Softwares/llvm-repo/llvm-project-latest/mlir/lib/IR/BuiltinAttributes.cpp:0:10                                                                                                                                                                                                              
llvm#13 0x000056513f76f2b6 mlir::tosa::MulOp::fold(mlir::tosa::MulOpGenericAdaptor<llvm::ArrayRef<mlir::Attribute>>) /local-ssd/sayans/Softwares/llvm-repo/llvm-project-latest/mlir/lib/Dialect/Tosa/IR/TosaCanonicalizations.cpp:0:0
```

from the folder for `tosa::mul` since the zero value was being reshaped
to `?x17` size which isn't supported. AFAIK, `tosa.const` requires all
dimensions to be static. So in this case, the fix is to not to fold the
op.
JDPailleux pushed a commit that referenced this pull request Aug 20, 2025
This can happen when JIT code is run, and we can't symbolize those
frames, but they should remain numbered in the stack. An example
spidermonkey trace:

```
    #0 0x564ac90fb80f  (/builds/worker/dist/bin/js+0x240e80f) (BuildId: 5d053c76aad4cfbd08259f8832e7ac78bbeeab58)
    #1 0x564ac9223a64  (/builds/worker/dist/bin/js+0x2536a64) (BuildId: 5d053c76aad4cfbd08259f8832e7ac78bbeeab58)
    #2 0x564ac922316f  (/builds/worker/dist/bin/js+0x253616f) (BuildId: 5d053c76aad4cfbd08259f8832e7ac78bbeeab58)
    #3 0x564ac9eac032  (/builds/worker/dist/bin/js+0x31bf032) (BuildId: 5d053c76aad4cfbd08259f8832e7ac78bbeeab58)
    #4 0x0dec477ca22e  (<unknown module>)
```

Without this change, the following symbolization is output:

```
    #0 0x55a6d72f980f in MOZ_CrashSequence /builds/worker/workspace/obj-build/dist/include/mozilla/Assertions.h:248:3
    #1 0x55a6d72f980f in Crash(JSContext*, unsigned int, JS::Value*) /builds/worker/checkouts/gecko/js/src/shell/js.cpp:4223:5
    #2 0x55a6d7421a64 in CallJSNative(JSContext*, bool (*)(JSContext*, unsigned int, JS::Value*), js::CallReason, JS::CallArgs const&) /builds/worker/checkouts/gecko/js/src/vm/Interpreter.cpp:501:13
    #3 0x55a6d742116f in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason) /builds/worker/checkouts/gecko/js/src/vm/Interpreter.cpp:597:12
    #4 0x55a6d80aa032 in js::jit::DoCallFallback(JSContext*, js::jit::BaselineFrame*, js::jit::ICFallbackStub*, unsigned int, JS::Value*, JS::MutableHandle<JS::Value>) /builds/worker/checkouts/gecko/js/src/jit/BaselineIC.cpp:1705:10
    #4 0x2c803bd8f22e  (<unknown module>)
```

The last frame has a duplicate number. With this change the numbering is
correct:

```
    #0 0x5620c58ec80f in MOZ_CrashSequence /builds/worker/workspace/obj-build/dist/include/mozilla/Assertions.h:248:3
    #1 0x5620c58ec80f in Crash(JSContext*, unsigned int, JS::Value*) /builds/worker/checkouts/gecko/js/src/shell/js.cpp:4223:5
    #2 0x5620c5a14a64 in CallJSNative(JSContext*, bool (*)(JSContext*, unsigned int, JS::Value*), js::CallReason, JS::CallArgs const&) /builds/worker/checkouts/gecko/js/src/vm/Interpreter.cpp:501:13
    #3 0x5620c5a1416f in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason) /builds/worker/checkouts/gecko/js/src/vm/Interpreter.cpp:597:12
    #4 0x5620c669d032 in js::jit::DoCallFallback(JSContext*, js::jit::BaselineFrame*, js::jit::ICFallbackStub*, unsigned int, JS::Value*, JS::MutableHandle<JS::Value>) /builds/worker/checkouts/gecko/js/src/jit/BaselineIC.cpp:1705:10
    #5 0x349f24c7022e  (<unknown module>)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants