speedup ROCm AMD Unified Memory Architecture

GGML_HIP_UMA allow to use hipMallocManaged tu use UMA on AMD/HIP GPU.

I have a Ryzen 7940HS an made some test. Using UMA allow use much more memorie than reserved VRAM on the igpu and it is nice. It allow some speed up over CPU. But by default it use "Fine-grained" memorie that is "slow". If we can use "Coarse-grained" we can have more speed.

What I did for test is replace:
```c++
#define cudaMalloc hipMallocManaged
#define cudaMallocHost(ptr, size) hipHostMalloc(ptr, size)
#else
#define cudaMalloc hipMalloc
#define cudaMallocHost(ptr, size) hipHostMalloc(ptr, size, hipHostMallocDefault)
#endif
```
by:
```c++
#ifdef GGML_HIP_UMA
template<typename T>
static inline auto gpuAlloc(T** adr, size_t size) {
    auto res = hipMallocManaged(adr, size);
    if (res == hipSuccess) {
        return hipMemAdvise (*adr, size, hipMemAdviseSetCoarseGrain, 0);
    }
    return res;
}
#define cudaMalloc gpuAlloc
#define cudaMallocHost(ptr, size) hipHostMalloc(ptr, size)
#else
#define cudaMalloc hipMalloc
#define cudaMallocHost(ptr, size) hipHostMalloc(ptr, size, hipHostMallocDefault)
#endif
```

I test with llamafile 8.4 source code (sorry I made some other test with it like add option to add option with UMA...) 
With "mistral-7b-instruct-v0.2.Q8_0.gguf" on a "Framwork 16 with 7940HS / no dGPU + fedora 40 (rocm6)" I have this result:

```
// - zen4 x 8 => CPU
llama_print_timings:        load time =    1383.08 ms
llama_print_timings:      sample time =       4.02 ms /   534 runs   (    0.01 ms per token, 132835.82 tokens per second)
llama_print_timings: prompt eval time =   29762.54 ms /  1466 tokens (   20.30 ms per token,    49.26 tokens per second)
llama_print_timings:        eval time =   82369.69 ms /   533 runs   (  154.54 ms per token,     6.47 tokens per second)
llama_print_timings:       total time =  112190.80 ms /  1999 tokens

// - gfx1103 / rocblas / HSA_OVERRIDE_GFX_VERSION=11.0.1 
llama_print_timings:        load time =    5391.01 ms
llama_print_timings:      sample time =       2.84 ms /   406 runs   (    0.01 ms per token, 143058.49 tokens per second)
llama_print_timings: prompt eval time =   14886.53 ms /  1466 tokens (   10.15 ms per token,    98.48 tokens per second)
llama_print_timings:        eval time =   67061.92 ms /   405 runs   (  165.58 ms per token,     6.04 tokens per second)
llama_print_timings:       total time =   82020.34 ms /  1871 tokens

// - gfx1103 / rocblas / HSA_OVERRIDE_GFX_VERSION=11.0.1 + "hipMemAdviseSetCoarseGrain"
llama_print_timings:        load time =    5470.93 ms
llama_print_timings:      sample time =       2.94 ms /   121 runs   (    0.02 ms per token, 41212.53 tokens per second)
llama_print_timings: prompt eval time =    8093.41 ms /  1466 tokens (    5.52 ms per token,   181.14 tokens per second)
llama_print_timings:        eval time =   12917.88 ms /   120 runs   (  107.65 ms per token,     9.29 tokens per second)
llama_print_timings:       total time =   21095.91 ms /  1586 tokens
```

As you can see I get x2 on GPU with "CoarseGrain" and x4 from CPU for prompt eval. 
And event +40% for prompte processing ...

Note: to be fare I have crache from time to time on GPU... but that an other story and this gfx1103 is not fully supported (I don't know if crache is because of rocm 6.0 rebuild by Fedora or existe on other rocm ... 

Ps: I need to use 'HSA_OVERRIDE_GFX_VERSION=11.0.1' env var... it is the faster than 1100 and 1102 until the 1103 is available.

ps2: sorry for my bad english ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

speedup ROCm AMD Unified Memory Architecture #7399

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

speedup ROCm AMD Unified Memory Architecture #7399

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions