Skip to content

[OpenMP][OMPT] initial_device_num on initialize callback set incorrectly #134451

Open
@Thyre

Description

@Thyre
Contributor

We've recently released Score-P v9.0, having support for the OpenMP target via the OpenMP Tools Interface for the first time.

However, a user quickly noticed an issue, which boils down to a bug in LLVM.
When the OpenMP runtime initializes a connected tool, the following callback will be dispatched:

typedef int (*ompt_initialize_t) (ompt_function_lookup_t lookup, int initial_device_num, ompt_data_t *tool_data);

initial_device_num is defined as "the value that a call to omp_get_initial_device would return". This itself is defined as being "the value of the device number is the value of omp_initial_device or the value returned by the omp_get_num_devices routine".
Looking at the actual returned values however, this is not the case. Instead, LLVM always reports initial_device_num = 0. Since LLVM also does not implement ompt_get_num_devices correctly

return 1; // only one device (the current device) is available

a tool cannot safely determine if a device identifier in some target callback is actually the host.

LLVM typically uses -1 for this, but tools should not have to rely on guessing this, as it might or might not change in later versions. This also makes supporting different runtimes more complicated.


To reproduce the issue:

#include <omp.h>

#include <omp-tools.h>
#include <stdlib.h>
#include <assert.h>
#include <inttypes.h>
#include <string.h>

#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdatomic.h>

/* MAIN */

int main( int argc, char** argv )
{
#pragma omp target
    {
        printf( "Hello from target region\n" );
    }
}

/* OMPT INTERFACE */

#if ( defined( __ppc64__ ) || defined( __powerpc64__ ) || defined( __PPC64__ ) )
#define OMPT_TOOL_CPU_RELAX ( ( void )0 )
#elif ( defined( __x86_64 ) || defined( __x86_64__ ) || defined( __amd64 ) || defined( _M_X64 ) )
#define OMPT_TOOL_CPU_RELAX __asm__ volatile ( "pause" )
#elif ( defined( __aarch64__ ) || defined( __ARM64__ ) || defined( _M_ARM64 ) )
#define OMPT_TOOL_CPU_RELAX __asm__ volatile ( "yield" )
#else
#define OMPT_TOOL_CPU_RELAX ( ( void )0 )
#endif


/* Test-and-test-and-set lock. Mutexes are of type bool */
#define OMPT_TOOL_LOCK( MUTEX ) \
    while ( true ) \
    { \
        if ( atomic_flag_test_and_set_explicit( &( MUTEX ), memory_order_acquire ) != true ) \
        { \
            break; \
        } \
        OMPT_TOOL_CPU_RELAX; \
    }
#define OMPT_TOOL_UNLOCK( MUTEX ) atomic_flag_clear_explicit( &( MUTEX ), memory_order_release );

#define OMPT_TOOL_GUARDED_PRINTF( ... ) \
    OMPT_TOOL_LOCK( ompt_tool_printf_mutex ) \
    printf( __VA_ARGS__ ); \
    OMPT_TOOL_UNLOCK( ompt_tool_printf_mutex )


_Thread_local int32_t ompt_tool_tid          = -1; /* thread counter. >= 1 after thread_begin  */
atomic_flag           ompt_tool_printf_mutex = ATOMIC_FLAG_INIT;


static int
my_initialize_tool( ompt_function_lookup_t lookup,
                    int                    initial_device_num,
                    ompt_data_t*           tool_data )
{
    OMPT_TOOL_GUARDED_PRINTF( "[%s] tid = %" PRId32 " | initial_device_num %d\n",
                              __FUNCTION__,
                              ompt_tool_tid,
                              initial_device_num );
    return 1; /* non-zero indicates success */
}

static void
my_finalize_tool( ompt_data_t* tool_data )
{
    OMPT_TOOL_GUARDED_PRINTF( "[%s] tid = %" PRId32 "\n",
                              __FUNCTION__,
                              ompt_tool_tid );
}

ompt_start_tool_result_t*
ompt_start_tool( unsigned int omp_version,
                 const char*  runtime_version )
{
    setbuf( stdout, NULL );
    OMPT_TOOL_GUARDED_PRINTF( "[%s] tid = %" PRId32 " | omp_version %d | runtime_version = \'%s\'\n",
                              __FUNCTION__,
                              ompt_tool_tid,
                              omp_version,
                              runtime_version );
    static ompt_start_tool_result_t tool = { &my_initialize_tool,
                                             &my_finalize_tool,
                                             ompt_data_none };
    return &tool;
}

Building the reproducer on a system with at least one GPU (maybe -fopenmp-targets=x86_64 works too), one will see the following result:

$ clang -fopenmp --offload-arch=sm_75 test.c
$ ./a.out
[ompt_start_tool] tid = -1 | omp_version 201611 | runtime_version = 'LLVM OMP version: 5.0.20140926'
[my_initialize_tool] tid = -1 | initial_device_num 0
Hello from target region
[my_finalize_tool] tid = -1

As one can see initial_device_num is 0. In my case, it should be either -1 or 1.

Activity

llvmbot

llvmbot commented on Apr 4, 2025

@llvmbot
Member

@llvm/issue-subscribers-openmp

Author: Jan André Reuter (Thyre)

We've recently released Score-P v9.0, having support for the OpenMP target via the OpenMP Tools Interface for the first time.

However, a user quickly noticed an issue, which boils down to a bug in LLVM.
When the OpenMP runtime initializes a connected tool, the following callback will be dispatched:

typedef int (*ompt_initialize_t) (ompt_function_lookup_t lookup, int initial_device_num, ompt_data_t *tool_data);

initial_device_num is defined as "the value that a call to omp_get_initial_device would return". This itself is defined as being "the value of the device number is the value of omp_initial_device or the value returned by the omp_get_num_devices routine".
Looking at the actual returned values however, this is not the case. Instead, LLVM always reports initial_device_num = 0. Since LLVM also does not implement ompt_get_num_devices correctly

return 1; // only one device (the current device) is available

a tool cannot safely determine if a device identifier in some target callback is actually the host.

LLVM typically uses -1 for this, but tools should not have to rely on guessing this, as it might or might not change in later versions. This also makes supporting different runtimes more complicated.


To reproduce the issue:

#include &lt;omp.h&gt;

#include &lt;omp-tools.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;assert.h&gt;
#include &lt;inttypes.h&gt;
#include &lt;string.h&gt;

#include &lt;stdbool.h&gt;
#include &lt;stdint.h&gt;
#include &lt;stdio.h&gt;
#include &lt;stdatomic.h&gt;

/* MAIN */

int main( int argc, char** argv )
{
#pragma omp target
    {
        printf( "Hello from target region\n" );
    }
}

/* OMPT INTERFACE */

#if ( defined( __ppc64__ ) || defined( __powerpc64__ ) || defined( __PPC64__ ) )
#define OMPT_TOOL_CPU_RELAX ( ( void )0 )
#elif ( defined( __x86_64 ) || defined( __x86_64__ ) || defined( __amd64 ) || defined( _M_X64 ) )
#define OMPT_TOOL_CPU_RELAX __asm__ volatile ( "pause" )
#elif ( defined( __aarch64__ ) || defined( __ARM64__ ) || defined( _M_ARM64 ) )
#define OMPT_TOOL_CPU_RELAX __asm__ volatile ( "yield" )
#else
#define OMPT_TOOL_CPU_RELAX ( ( void )0 )
#endif


/* Test-and-test-and-set lock. Mutexes are of type bool */
#define OMPT_TOOL_LOCK( MUTEX ) \
    while ( true ) \
    { \
        if ( atomic_flag_test_and_set_explicit( &amp;( MUTEX ), memory_order_acquire ) != true ) \
        { \
            break; \
        } \
        OMPT_TOOL_CPU_RELAX; \
    }
#define OMPT_TOOL_UNLOCK( MUTEX ) atomic_flag_clear_explicit( &amp;( MUTEX ), memory_order_release );

#define OMPT_TOOL_GUARDED_PRINTF( ... ) \
    OMPT_TOOL_LOCK( ompt_tool_printf_mutex ) \
    printf( __VA_ARGS__ ); \
    OMPT_TOOL_UNLOCK( ompt_tool_printf_mutex )


_Thread_local int32_t ompt_tool_tid          = -1; /* thread counter. &gt;= 1 after thread_begin  */
atomic_flag           ompt_tool_printf_mutex = ATOMIC_FLAG_INIT;


static int
my_initialize_tool( ompt_function_lookup_t lookup,
                    int                    initial_device_num,
                    ompt_data_t*           tool_data )
{
    OMPT_TOOL_GUARDED_PRINTF( "[%s] tid = %" PRId32 " | initial_device_num %d\n",
                              __FUNCTION__,
                              ompt_tool_tid,
                              initial_device_num );
    return 1; /* non-zero indicates success */
}

static void
my_finalize_tool( ompt_data_t* tool_data )
{
    OMPT_TOOL_GUARDED_PRINTF( "[%s] tid = %" PRId32 "\n",
                              __FUNCTION__,
                              ompt_tool_tid );
}

ompt_start_tool_result_t*
ompt_start_tool( unsigned int omp_version,
                 const char*  runtime_version )
{
    setbuf( stdout, NULL );
    OMPT_TOOL_GUARDED_PRINTF( "[%s] tid = %" PRId32 " | omp_version %d | runtime_version = \'%s\'\n",
                              __FUNCTION__,
                              ompt_tool_tid,
                              omp_version,
                              runtime_version );
    static ompt_start_tool_result_t tool = { &amp;my_initialize_tool,
                                             &amp;my_finalize_tool,
                                             ompt_data_none };
    return &amp;tool;
}

Building the reproducer on a system with at least one GPU (maybe -fopenmp-targets=x86_64 works too), one will see the following result:

$ clang -fopenmp --offload-arch=sm_75 test.c
$ ./a.out
[ompt_start_tool] tid = -1 | omp_version 201611 | runtime_version = 'LLVM OMP version: 5.0.20140926'
[my_initialize_tool] tid = -1 | initial_device_num 0
Hello from target region
[my_finalize_tool] tid = -1

As one can see initial_device_num is 0. In my case, it should be either -1 or 1.

Thyre

Thyre commented on Apr 4, 2025

@Thyre
ContributorAuthor

Looking at the source code a bit, there are two places where OMPT is initialized:

ompt_fn_lookup, omp_get_initial_device(),

/* initial_device_num */ 0, /* tool_data */ nullptr);

The first case looks correct. However, I think we never see it in the case where offload is used, right?
The second case should never use 0, but also omp_get_initial_device. There are two possible values which can then be returned:

int omp_get_initial_device(void) { return -1; }

EXTERN int omp_get_initial_device(void) {

Here, I don't understand when which one is chosen...
Callbacks later on also use omp_get_initial_device, so we should stick with the same value everywhere. For sake of consistency with the non-tool side, having omp_get_num_devices would be nice. From a tool perspective however, I wouldn't really care.

Thyre

Thyre commented on Apr 9, 2025

@Thyre
ContributorAuthor

If I understand this correctly:

This is the actual initialization, using the host side of the OpenMP runtime:

ompt_fn_lookup, omp_get_initial_device(),

This connects the host side to the offload library:

/* initial_device_num */ 0, /* tool_data */ nullptr);

In the second case, initial_device_num and tool_data are unused, hence dummy values are passed.
Wouldn't hurt to pass the correct data though.

The host side initialization calls omp_get_initial_device(), which ends up in

EXTERN int omp_get_initial_device(void) {

This is where my guess work begins:

I think that we do not yet know how many devices exist when calling the initialization in the tool. Therefore, omp_get_initial_device() calls omp_get_num_devices() returning 0.
The same would probably also apply if we would implement ompt_get_num_devices(). This would work after the tool initialization, but not during it.

I'm wondering how much would break if we would just switch to returning -1, which is allowed per OpenMP 6.0 spec. (but would not be conformant with OpenMP 5.0 -- 5.2 ...)

self-assigned this
on Apr 9, 2025
mhalk

mhalk commented on Jun 5, 2025

@mhalk
Contributor

Hi @Thyre , sorry for the latency, I just spent some time to understand the issue at hand.

Unfortunately, I do not see a "good" solution ATM to correct the value during initialization.
AFAICT you are right and the number of devices is only known after initialization of the accelerators / devices has completed.

If returning "-1" as a known value during initialization alleviates the issue, IMHO that sounds reasonable.
Could you please point me to the parts of the spec that disallow "-1" for pre-6.0 and also the ones that allow for 6.x?

Thyre

Thyre commented on Jun 5, 2025

@Thyre
ContributorAuthor

Heya @mhalk, absolutely no problem 😄
Thanks a lot for looking into this at all!

The OpenMP 5.2 and 6.0 spec. state for ompt_start_tool:

5.2 (p.473, l.28):

The initial_device_num argument provides the value of omp_get_initial_device().

6.0 (p.745, l.19):

The initial_device_num argument provides the value that a call to omp_get_initial_device would return.

When then looking at omp_get_initial_device, we can see the following:

5.2 (p. 385, l.11):

. The value of the device number is the value returned by the omp_get_num_devices routine

6.0 (p. 598, l.16):

The value of the device number is the value of omp_initial_device or the value returned by the omp_get_num_devices routine

With omp_get_num_devices finally being defined as:

5.2 (p. 383, l. 20):

The omp_get_num_devices routine returns the number of available non-host devices onto which code or data may be offloaded

6.0 (p. 594,l. 16):

The omp_get_num_devices routine returns the value of the num-devices-var ICV, which is the number of available non-host devices onto which code or data may be offloaded

and omp_initial_device being a fixed constant for -1 (OpenMP 5.2: p. 346, l.14, OpenMP 6.0: p. 534, l. 5).


So while both specs are pretty much aligned in all of the definitions, OpenMP 6.0 allows the usage of omp_initial_device for omp_get_initial_device and therefore initial_device_num in OMPT. Using -1 would not entirely follow the OpenMP 5.2 spec. unfortunately.

Personally, I would be totally fine with LLVM using -1 for the host consistently, as omp_initial_device is defined as -1 even for 5.2, with the spec. even mentioning (OpenMP 5.2, p. 24, l. 2):

Additionally, the constant omp_initial_device can be used as an alias for the host device

which is exactly what we're looking for.

mhalk

mhalk commented on Jun 10, 2025

@mhalk
Contributor

Thanks for the effort you put into this; helps tremendously.

For OpenMP 6.0, I'd argue that the situation is clear and by using the constant omp_initial_device / -1, we adhere to the spec.

For OpenMP 5.2, omp_initial_device / -1 might not be obviously conformant on every path -- but IMO it actually is conformant with this situation.
According to the 5.2 spec: ompt_start_tool uses omp_get_initial_device (for determining initial_device_num)

Taking the following into account:

  1. 5.2 (p. 385, l. 3-4): omp_get_initial_device summary

    The omp_get_initial_device routine returns a device number that represents the host device.

  2. 5.2 (p. 385, l. 11): omp_get_initial_device effect

    The effect of this routine is to return the device number of the host device. [...]

  3. 5.2 (p. 24, l. 2-3): Execution Model

    Additionally, the constant omp_initial_device can be used as an alias for the host device [...]

Since omp_initial_device is an alias for the host device, omp_get_initial_device is IMO OpenMP 5.2 spec conformant when returning -1 while being called during ompt_start_tool.
Especially omp_get_initial_device's summary loosens the constraints: "returns a device number that represents the host device".

As you already stated:

I'm wondering how much would break if we would just switch to returning -1 [...]

I'll try to find that ^ out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @jplehr@EugeneZelenko@Thyre@mhalk@llvmbot

      Issue actions

        [OpenMP][OMPT] `initial_device_num` on `initialize` callback set incorrectly · Issue #134451 · llvm/llvm-project