-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Feature request] Any plans for AMD XDNA AI Engine support on Ryzen 7x40 processors? #1499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is there any API or example code of how to use it out in the wild? I can't even find a whitepaper or anything like that. |
If there is a BLAS or even a matrix multiplication API available, it should be easy to add. |
Do the non-mobile ryzen 7x00 cpus also have this feature? |
As far as i know: no, only the mobile Phoenix 7040 Series, see wikipedia |
I second this. With the rise of ML accelerators in PCs starting with Ryzen AI and Meteor Lake VPU's, using them might result in big efficiency gains and speedups. I'm also sure once memory bottlenecks are reduced, more can be done by using tensor cores and the new RDNA3 AI accelerators more efficiently. Then, you also have NPUs in modern smartphone that can be leveraged. Hardware acceleration is the way forward. |
An issue is that inference either has to totally be on the XPU (excluding the possibility of partial OpenCL/CUDA acceleration), or support zero copy/unified memory to avoid cost prohibitive copies. IGP acceleration is similarly problematic. Its theoretically not impossible... For instance, Tencent ncnn has some kind of Vulkan zero copy mechanism for "unified memory" devices. I think the library AMD uses for the demo is this: https://github.com/Xilinx/Vitis-AI |
Don't know the specifics of this hardware, but most likely you need to follow the steps in #1642 to make it work |
AMD's Unified Inference Frontend (UIF) might also be useful. |
Here's some more information: |
Second this. I'm willing to look into and actually writing it. But where TF is the API? |
I think it's here: https://www.xilinx.com/products/design-tools/vitis/vitis-ai.html |
@EwoutH Ohh.. thanks. Dang I used to do work with Vitis in collage. It'll be quite a task to gain access to low level API on that. Update: Waiting for Acer to release the new Swift edge in my country. |
There is now Ryzen AI Software Platform documentation: GitHub, ReadTheDocs. The Ryzen AI cloud-to-client demo has also been updated over the last weeks. |
That looks grim. AMD did not expose low level (callable) access to their accelerator. And everything has to go through ONNX. |
@marty1885 I believe ONNX uses this provider to access the accelerator. Does that help in any way? |
@flobeier Unfortunately no. That ONNX provider calls Xilinx(now AMD)'s Vitis AI Library. Vitis also works on a graph based system and the provider is basically a translator from ONNX graph to AMD's XIR graph. I tried searching the entire document. Vitis doesn't seem to expose low level functions like GEMM, convolution, pooling, etc.. directly to the user. Which we need for GGML work with it. You can find the documents here |
@marty1885 if my understanding of the hierarchy of the libraries presented here is correct, then the lowest abstraction is the Xilinx Runtime Library (XRT). Here is the documentation: https://xilinx.github.io/XRT/master/html/index.html |
JFYI, https://ryzenai.docs.amd.com/en/latest/modelcompat.html
|
The operation set is ok, it's good enough to accelerate LLaMA. the problem is It's possible to hack around it. But it's likely to cause huge overhead and just too much work for me :( |
There's Vitis BLAS, but no idea if it actually works. It's under Apache 2 license so would be compatible with llama.cpp. This would only require the Xilinx runtime which from what I see should work Some of the results of me browsing around: RyzenAI-SWThe main Ryzen AI github repo https://github.com/amd/RyzenAI-SW:
If you can convert ONNX to Vitis Graphs and run it the runtime and everything used by it should probably work as well on its own: Xilinx Runtime
Directly coding Vitis
I don't have a newer Ryzen Laptop, but will be getting one early next year (waiting for Hawk Point or more Phoenix laptops) and will look deeper into this then if noone starts working on this before me (if it's even possible, AMD docs on point again...) |
Edit: Found this forum post HLS not supported on V70 so HLS isn't supported and the lowest level access is just XIR after all... The supported operators also don't seem promising (matmul translates to a conv2d which I have no idea how that would work, but probably not as efficient as normal matmul) Actually supported operators are here: https://docs.xilinx.com/r/en-US/ug1414-vitis-ai/Currently-Supported-Operators You can write custom operators, but since no HLS only CPU. It makes sense since otherwise it wouldn't really be possible to share the NPU between different processes as HLS just generates an xlcbin for flashing. The only lead to actual matmul access I have is the windows IPU driver. It has a Kinda sad, since the hardware seemed really cool to play around with: hardware used by XDNA/IPU and HLS. I thought we could use it just like an ALU for vectors and matrices, but i guess not .-. |
Thanks for the link, nice to see things moving forward. Since the example only used the XRT directly I decided to dig through the RyzenAI SW repo again if there are maybe any other examples for matmul with XRT inside it --- and found something promising: example/transformers/ops/cpp/qlinear_2/qlinear_2.hpp. If someone is fine with testing around a bunch it should be doable to accelerate matmuls with int8 x int8 NPU kernels, the code is semi well documented and for testing around you can first use the python API as well. I'll probably not look into this anymore for a while/can't, thought HawkPoint laptops would come out faster -.- so good luck if anyone else wants to try this! The issue with creating this without an existing API is that you'd need to pack all dependencies in that project in the .cpp file which isn't the worst, but if these update/have bugs this is gonna be annoying to maintain. The file doc comments still says experimental as well and I assume at some point there will be an official API for this anyways so this would be more experimenting/trying it out rather than something that should be made into a PR imo. Also isn't clear to me if an application/project should provide the xlcbin for supported targets themselves (uses this one for phoenix) or if they will be packed in a common folder like the overlays in the |
The Ryzen 7 8700 APU can have its DDR5 memory channels be overclocked up to DDR5-10600. This would result in a purely theoretical memory bandwidth of around 140 GB/s, which is roughly half of a low end GPU such as the 7600 XT. In exchange you could easily get up to 48 GB of system RAM using 2x 24 GB sticks. Populating all slots will cause the frequency to decrease. Ok, back to the topic at hand. Ryzen APUs come with the Ryzen AI Engine. From my research based on arlo-phoenix's references, you really only need XRT to launch DPU kernels. You can think of XRT as the equivalent to CUDA or OpenCL for the DPU. A DPU kernel in the repo has a corresponding .txt file (yes, the instructions are hex encoded) such as "a8w8acc32_8_2k_2k.txt" referring to a 8bit 2048x2048 matrix multiplication. So the two remaining questions are:
XRT uses the concept of buffer objects (BO): https://xilinx.github.io/XRT/2023.2/html/xrt_native.main.html#buffer-apis This is speculation from my side, but Apache TVM integration already exists for the DPUs in Xilinx's FPGAs. We don't know which exact DPU is in the Ryzen, but here is a list of supported DPUs: https://tvm.apache.org/docs/how_to/deploy/vitis_ai.html Apache TVM by itself does not actually compile anything to the DPU instructions. https://github.com/apache/tvm/blob/main/python/tvm/relay/op/contrib/vitis_ai.py It simply delegates the actual work to https://github.com/Xilinx/pyxir
Ok, so we are now down to the final question. Does pyxir actually generate the code, or will it call some proprietary library to which we have no access behind the scenes? I haven't dug too deep but here is a list of implemented DPU operations: https://github.com/Xilinx/pyxir/tree/master/python/pyxir/graph/ops Here is a description of the XIR based flow: https://docs.xilinx.com/r/1.1-English/ug1414-vitis-ai/XIR-Based-Flow-for-DPUv3 |
Considering the extreme similarity in the micro-architecture of the Versal AI engine versus the Ryzen AI engine and the extensive documentation of the Versal AI engine, this isn't going to be a ROCm dud. I'm making a bold prediction: The market for AI inference using consumer GPUs will disappear by the end of 2025 and nobody will ever consider using CUDA or ROCm for AI inference on a PC again. |
Ok, pyxir was a red herring, but my reasoning was the following: there are a few compiled .dll files containing a reference to Apache TVM and therefore that was my starting point and it was decent one. Overview over the Hardware architecture of the Ryzen AI NPUThe Ryzen 7040 officially has 10 TOPs and 20 tiles. This is 0.5 TOPs and both the Xilinx and Ryzen AI documentation confirms this. The Vitis documentation refers to two types of AI Engine Tiles, the old ones, which do not even support bfloat16 and the newer ones called AI Engine-ML or AIE-ML in short. The bfloat16 performance is one quarter of the advertised integer TOPs, namely 2.5 TFLOPS (4 TFLOPs for 8700G). Each Tile contains a fully C programmable RISC/VLIW processor with 16 KByte instruction memory, 64KByte of scratchpad memory and is also capable of accessing the memory of neighboring tiles. The tiles are arranged in colums with 4 tiles and 1 memory tile that has a 512KB of SRAM with a memory bandwidth of at least 30GB/s per memory tile. This means the total SRAM in the 8700 G NPU will be 2 MB for the local tile memory and another 4 MB in the memory tiles. Resources: Rialto How to program the AIE-ML TilesAs mentioned above, there are extreme similarities between the AI Engine Tiles used by Xilinx and the new AIE-ML tiles developed after the AMD acquisition. Xilinx AI Engine Tiles were always C programmable using Vitis AI, but AMD has decided to go a different route with Ryzen AI. To get an overview I recommend watching this video: https://www.youtube.com/watch?v=pfazqbOODIU. It is highly recommended. AMD has doubled down on building an MLIR (Multi Level Intermediate Representation) backend for Ryzen AI called mlir-aie which is available https://github.com/Xilinx/mlir-aie.
So, first of all, this toolchain still depends on Vitis, but you can get a free license. Therefore it is not fully opensource, but you no longer need an FPGA license for access. The proprietary components are kept at a minimum. So, how do you generate the MLIR? For that you need a frontend, kind of like clang is a C/C++ frontend for LLVM, polygeist is a C/C++ frontend for MLIR. Here is a guide on how to do that: https://riallto.ai/4_2_write_your_kernel.html and here is the example kernel they showed.
There, it is just simple C code plus some vector intrinsics. To make it easier for the kernel developers, they have also developed automatic vectorization. https://xilinx.github.io/mlir-aie/AIEVectorization
which correctly vectorized into this MLIR:
Resources: Recommendations for llama.cpp kernel developersLinux Setup and Build Instructions I don't think I have the time to work on this, but I have looked at the instrinsics. There are the most relevant pages that you need to have glanced over at least once. Vector Operations: Bitwise logical is essential for dequantization. Each AI-Engine tile has a 64 vector lanes for the data type int8. You do not get anything smaller or bigger than this but the good news is that all common bitwise operators are supported, such as bit shifting, AND masking, bitwise OR, etc. This means dequantization is possible. These operators do not exist in the old AI Engine tiles, which only support 8 bits, take it or leave. So the takeaway is that the hardware is not gimped. Now onto the more LLM focused part. You can multiply a 4x8 with an 8x4 matrix or multiply 16 1x2 matrices with 16 2x1 matrices (equivalent to 16 dot products of two 2-element vectors). This gives you the theoretical 4 TFLOPs performance. Any other operation will give you less, except 4 bit and 8 bit integers of course. So what should be accelerated? I am not exactly an expert in the llama2 architecture, but you should focus on plain old matrix-matrix and matrix-vector multiplication. You should put the columns from the second matrix (for matrix matrix multiplication) in the memory tiles and the intermediate results in the SRAM and stream the quantized matrix weights directly from DRAM and dequantize them without spilling to the local scratchpad SRAM. The actual multiplication itself shouldn't be a big deal, but you might have to accumulate the results. The 16 dot product strategy might be easy to whip up and that should already get you 64 flops/cycle at which point your bottleneck will never be compute ever again, only memory bandwidth. The CPU cores will stay mostly idle and we will run LLMs like mixtral 8x7b locally on our Strix Point Laptops with 32 to 64 GB for probably under 1500€ instead of absurdly expensive NVidia GPUs where just the GPU costs you 2000€ with only 24 GB. I'm sure there will be enthusiasts stacking three to four GPUs, so that they can run Goliath 120b at more than 1 token per second, but I would honestly be happy if I could get 8bit mixtral working. Personal remarksBy the way, the reason why I am sort of obsessed with this is that I originally had the idea to use Effinix FPGAs, not because of the compute but rather, because you can just keep adding memory channels and connect FPGAs to each other similar to Groq. This idea is dead in the water for two reasons: 1. Effinix FPGAs run their memory at really low frequencies and only DDR4 so you only get 28GB/s per FPGA. 2. The total bfloat16 FLOPS are 0.5 TFLOPs. This means a single AI Engine ML Tile will roast the FPGA and you get 32 of them per chip. So really, the only benefit of the FPGA approach would be that you could stack the memory all the way to 256GB of RAM while staying under 5000€ per node. That is such a small niche that I recommend everyone to stop bothering with it. If you want to compete with Ryzen AI, build 256 bit memory channels onto your AI chip and put the DRAM over the AI accelerator via Package on Package or Multi Chip Module or System in Package or at least by putting the DRAM chips at the opposite side of the PCB, so that the vias go straight from the AI chip to the DRAM chip on the other side. Literally nobody is doing this except Apple. The NVIDIA way (HBM) is too expensive and AMD is too stupid to get their GPU drivers fixed so their GPUs are a big no-no. The AI Engines were originally developed by Xilinx so they aren't cursed. |
You just need to be aware that the ryzen ai is not exposed on every laptop.
I bought ae lenovo with 7840hs and 32gb of ram crazy cheap, for under
600gbp and ryzen ai is not showing up as a device under linux.
…On Thu, 7 Mar 2024, 22:18 Vorlent, ***@***.***> wrote:
Ok, pyxir was a red herring, but my reasoning was the following: there are
a few compiled .dll files containing a reference to Apache TVM and
therefore that was my starting point and it was decent one.
Overview over the Hardware architecture of the Ryzen AI NPU
The Ryzen 7040 officially has 10 TOPs and 20 tiles. This is 0.5 TOPs and
both the Xilinx and Ryzen AI documentation confirms this. The Vitis
documentation refers to two types of AI Engine Tiles, the old ones, which
do not even support bfloat16 and the newer ones called AI Engine-ML or
AIE-ML in short. The bfloat16 performance is one quarter of the advertised
integer TOPs, namely 4 TFLOPs. Each Tile contains a fully C programmable
RISC/VLIW processor with 16 KByte instruction memory, 64KByte of scratchpad
memory and is also capable of accessing the memory of neighboring tiles.
The tiles are arranged in colums with 4 tiles and 1 memory tile that has a
512KB of SRAM with a memory bandwidth of at least 30GB/s per memory tile.
This means the total SRAM in the 8700 G NPU will be 2 MB for the local tile
memory and another 4 MB in the memory tiles.
image.png (view on web)
<https://github.com/ggerganov/llama.cpp/assets/13166716/c073041d-a342-45f7-8923-ad57770d32e7>
Resources:
Versal Adaptive SoC AIE-ML Architecture Manual (AM020)
<https://docs.xilinx.com/r/en-US/am020-versal-aie-ml/Overview>
AI Engine-ML Intrinsics User Guide
<https://japan.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop.html>
AIE Tile VS AIE-ML Tile
<https://www.xilinx.com/products/technology/ai-engine.html>
Versal ACAP AI Engine
<https://www.xilinx.com/content/dam/xilinx/support/documents/architecture-manuals/am009-versal-ai-engine.pdf>
Riallto - an exploration framework for the AMD Ryzen™ AI NPU
<https://riallto.ai/index.html#riallto-an-exploration-framework-for-the-amd-ryzen-ai-npu>
](https://riallto.ai/index.html)
Rialto <https://github.com/AMDResearch/Riallto>
Design Rationale of Two Generations
of AI Engines (PDF slides)
<https://indico.cern.ch/event/1079717/contributions/4541496/attachments/2324531/3959170/Design%20Rationale%20of%20Two%20Generations%20of%20AI%20Engines%20.pdf>
How to program the AIE-ML Tiles
As mentioned above, there are extreme similarities between the AI Engine
Tiles used by Xilinx and the new AIE-ML tiles developed after the AMD
acquisition. Xilinx AI Engine Tiles were always C programmable using Vitis
AI, but AMD has decided to go a different route with Ryzen AI. To get an
overview I recommend watching this video:
https://www.youtube.com/watch?v=pfazqbOODIU. It is highly recommended.
AMD has doubled down on building an MLIR (Multi Level Intermediate
Representation) backend for Ryzen AI called mlir-aie which is available
https://github.com/Xilinx/mlir-aie.
This repository contains an [MLIR-based](https://mlir.llvm.org/) toolchain for AI Engine-enabled devices,
such as [AMD Ryzen™ AI](https://www.amd.com/en/products/ryzen-ai) and [Versal™] (https://www.xilinx.com/products/technology/ai-engine.html).
So, first of all, this toolchain still depends on Vitis, but you can get a
free license. Therefore it is not fully opensource, but you no longer need
an FPGA license for access. The proprietary components are kept at a
minimum. So, how do you generate the MLIR? For that you need a frontend,
kind of like clang is a C/C++ frontend for LLVM, polygeist is a C/C++
frontend for MLIR.
You can find Polygeist here: https://github.com/llvm/Polygeist.
Alternatively, anything that targets MLIR can compile kernels e.g. OpenAI's
Triton, which lets you program kernels in python.
Here is a guide on how to do that:
https://riallto.ai/4_2_write_your_kernel.html and here is the example
kernel they showed.
%%kernel
void passthrough(uint8_t *in_buffer, uint8_t *out_buffer, uint32_t nbytes)
{
for(int i=0; i<nbytes; i++) {
out_buffer[i] = in_buffer[i];
}
}
There, it is just simple C code plus some vector intrinsics. To make it
easier for the kernel developers, they have also developed automatic
vectorization. https://xilinx.github.io/mlir-aie/AIEVectorization
void conv2d(int img_in[17][272], int kernel_coeff[3][3], int img_out[16][256]) {
for(int r = 0; r < 16; r++)
for(int c = 0; c < 256; c++) {
int acc = 0;
for(int i = 0; i < 3; i++)
for(int j = 0; j < 3; j++) {
acc += img_in[r+i][c+j] * kernel_coeff[i][j];
}
img_out[r][c] = acc;
}
}
which correctly vectorized into this MLIR:
mlir-clang --function=conv2d conv2d_i32.c -S --raise-scf-to-affine | aie-opt --affine-loop-unroll="unroll-full unroll-full-threshold=3" --canonicalize -affine-super-vectorize="virtual-vector-size=8 vectorize-reductions" --aie-vectorize | aie-translate --aievec-to-cpp
void conv2d(int32_t * restrict v4, size_t m1, int32_t * restrict v5, size_t m2, int32_t * restrict v6, size_t m3) {
size_t v7 = 0;
size_t v8 = 2;
v8int32 v9 = *(v8int32 *)(v5 + 3*v7+v7);
v8int32 v10 = *(v8int32 *)(v5 + 3*v8+v8);
size_t v11 = 0;
size_t v12 = 16;
size_t v13 = 1;
for (size_t v14 = v11; v14 < v12; v14 += v13)
chess_prepare_for_pipelining
chess_loop_range(16, 16)
{
size_t v15 = 1;
size_t v16 = v14 + v15;
size_t v17 = 2;
size_t v18 = v14 + v17;
size_t v19 = 0;
size_t v20 = 256;
size_t v21 = 8;
for (size_t v22 = v19; v22 < v20; v22 += v21)
chess_prepare_for_pipelining
chess_loop_range(32, 32)
{
v16int32 v23;
int32_t * restrict r_v23_v4 = v4;
v23 = upd_w(v23, 0, *(v8int32 *)(r_v23_v4 + 272*v14+v22));
v8acc80 v24 = lmul8(v23, 0, 0x76543210, v9, 0, 0x00000000);
size_t v25 = 1;
size_t v26 = v22 + v25;
v23 = upd_w(v23, 1, *(v8int32 *)(r_v23_v4 + 272*v14+v26 + 7));
v24 = lmac8(v24, v23, 1, 0x76543210, v9, 1, 0x00000000);
v24 = lmac8(v24, v23, 2, 0x76543210, v9, 2, 0x00000000);
v16int32 v27;
int32_t * restrict r_v27_v4 = v4;
v27 = upd_w(v27, 0, *(v8int32 *)(r_v27_v4 + 272*v16+v22));
v24 = lmac8(v24, v27, 0, 0x76543210, v9, 3, 0x00000000);
v27 = upd_w(v27, 1, *(v8int32 *)(r_v27_v4 + 272*v16+v26 + 7));
v24 = lmac8(v24, v27, 1, 0x76543210, v9, 4, 0x00000000);
v24 = lmac8(v24, v27, 2, 0x76543210, v9, 5, 0x00000000);
v16int32 v28;
int32_t * restrict r_v28_v4 = v4;
v28 = upd_w(v28, 0, *(v8int32 *)(r_v28_v4 + 272*v18+v22));
v24 = lmac8(v24, v28, 0, 0x76543210, v9, 6, 0x00000000);
v28 = upd_w(v28, 1, *(v8int32 *)(r_v28_v4 + 272*v18+v26 + 7));
v24 = lmac8(v24, v28, 1, 0x76543210, v9, 7, 0x00000000);
v24 = lmac8(v24, v28, 2, 0x76543210, v10, 0, 0x00000000);
v8int32 v29 = srs(v24, 0);
*(v8int32 *)(v6 + 256*v14+v22) = v29;
}
}
return;
}
Resources:
AI Engine Kernel Coding Best Practices Guide (UG1079)
<https://docs.xilinx.com/r/2021.2-English/ug1079-ai-engine-kernel-coding/Overview>
T2: Leveraging MLIR to
Design for AI Engines
<https://www.xilinx.com/content/dam/xilinx/publications/presentations/leveraging-mlir-to-design-for-aie-fpga-2023.pdf>
MLIR-based AIEngine toolchain
<https://xilinx.github.io/mlir-aie/index.html>
AIE Build License
<https://riallto.ai/prerequisites-aie-license.html#aie-build-license>](
https://riallto.ai/prerequisites-aie-license.html)
Recommendations for llama.cpp kernel developers
I don't think I have the time to work on this, but I have looked at the
instrinsics. There are the most relevant pages that you need to have
glanced over at least once.
Vector Operations: Bitwise logical
<https://japan.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop__logic.html>
is essential for dequantization. Each AI-Engine tile has a 64 vector lanes
for the data type int8. You do not get anything smaller or bigger than this
but the good news is that all common bitwise operators are supported, such
as bit shifting, AND masking, bitwise OR, etc. This means dequantization is
possible. These operators do not exist in the old AI Engine tiles, which
only support 8 bits, take it or leave. So the takeaway is that the hardware
is not gimped.
Now onto the more LLM focused part. You can multiply a 4x8 with an 8x4
matrix or multiply 16 1x2 matrices with 16 2x1 matrices (equivalent to 16
dot products of two 2-element vectors). This gives you the theoretical 4
TFLOPs performance. Any other operation will give you less, except 4 bit
and 8 bit integers of course.
So what should be accelerated? I am not exactly an expert in the llama2
architecture, but you should focus on plain old matrix-matrix and
matrix-vector multiplication. You should put the columns from the second
matrix (for matrix matrix multiplication) in the memory tiles and the
intermediate results in the SRAM and stream the quantized matrix weights
directly from DRAM and dequantize them without spilling to the local
scratchpad SRAM. The actual multiplication itself shouldn't be a big deal,
but you might have to accumulate the results. The 16 dot product strategy
might be easy to whip up and that should already get you 64 flops/cycle at
which point your bottleneck will never be compute ever again, only memory
bandwidth. The CPU cores will stay mostly idle and we will run LLMs like
mixtral 8x7b locally on our Strix Point Laptops with 32 to 64 GB for
probably under 1500€ instead of absurdly expensive NVidia GPUs where just
the GPU costs you 2000€ with only 24 GB. I'm sure there will be enthusiasts
stacking three to four GPUs, so that they can run Goliath 120b at more than
1 token per second, but I would honestly be happy if I could get 8bit
mixtral working.
Personal remarks
By the way, the reason why I am sort of obsessed with this is that I
originally had the idea to use Effinix FPGAs, not because of the compute
but rather, because you can just keep adding memory channels and connect
FPGAs to each other similar to Groq. This idea is dead in the water for two
reasons: 1. Effinix FPGAs run their memory at really low frequencies and
only DDR4 so you only get 28GB/s per FPGA. The total bfloat16 FLOPS are 0.5
TFLOPs. This means a single AI Engine ML Tile will roast the FPGA and you
get 32 of them per chip. So really, the only benefit of the FPGA approach
would be that you could stack the memory all the way to 256GB of RAM while
staying under 5000€ per node. That is such a small niche that I recommend
everyone to stop bothering with it. If you want to compete with Ryzen AI,
build 256 bit memory channels onto your AI chip and put the DRAM over the
AI accelerator via Package on Package or Multi Chip Module or System in
Package or at least by putting the DRAM chips at the opposite side of the
PCB, so that the vias go straight from the AI chip to the DRAM chip on the
other side. Literally nobody is doing this except Apple. The NVIDIA way
(HBM) is too expensive and AMD is too stupid to get their GPU drivers fixed
so their GPUs are a big no-no. The AI Engines were originally developed by
Xilinx so they aren't cursed.
—
Reply to this email directly, view it on GitHub
<#1499 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG4X45SZJ3TPAKSXUNO55TYXDKTLAVCNFSM6AAAAAAYE25E2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBUGUYTEMRQHA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
FWIW: I found the source repository of the to be upstreamed LLVM back-end, documentation included. In case this helps: https://github.com/Xilinx/llvm-aie/wiki/E2E-Linux-Example |
An initial support for Strix Point arrived to the LLVM-AIE. I'll try to test various options on getting the XDNA2 working with llama.cpp.
EDIT: The 890M iGPU alone is doing pretty good with Ollama: ollama/ollama#3004 (comment) |
829 files changed lgtm |
Linux 6.14 will have the xdna Ryzen AI NPU Accelerator driver https://lore.kernel.org/lkml/CAPM=9tw+ySbm80B=zHVhodMFoS_fqNw_v4yVURCv3cc9ukvYYg@mail.gmail.com/ |
Finally. |
Do you have the patch thread? I maintain https://aur.archlinux.org/packages/linux-mainline-um5606 and I want to make sure I'm providing up-to-date NPU patches |
Will this work for the NPU in the AI 9 HX 370 CPU series? (Strix Point I believe) |
Yes, it's the same XDNA architecture, just in the second revision |
Yes! I've actually been using that patchset for about a month on my Asus UM506, which has that same processor. |
Awesome!! Do you have any benchmarks? |
Folks, by loading a driver you won't get your LLMs any faster. AIE (NPU) is a separated compute unit and requires coding support for it in the SW - adding AIE backend in ggml (https://www.xilinx.com/htmldocs/xilinx2024_2/aiengine_api/aie_api/doc/index.html). The ELF must be wrapped into AXLF and loaded onto NPU. I'm actually looking at this ATM but this is far bigger task than I originally expected especially if one wants to use full power of AIE and it's tiles arch. |
I don't think many people here think that the driver is the end of the story, I say it's just the beginning. How are developers supposed to leverage the NPU without drivers. Now that AMD has finally delivered the bare minimum to support their hardware it is possible for devs to make their software utilize it. Is there any documentation from Xilinx or Amd that helps with porting software? If so it would be appreciated if you could share some links |
unstale |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
XDNA support has been released in the 6.14 Kernel, and I still have not heard anything from AMD engineers about enabling it anywhere |
Is that a surprise? AMD doesn't take software very seriously. It has been like that for as long as I work with AMD products, so for at least 25+ years. Neglecting the software side of the business is in their DNA. |
perhaps that's why linus has no much interest in merging it ---- it don't make much difference before and after the merging after all... |
There seems to be someone from AMD quietly & actively working on enabling ggml to run on AMD NPUs based on HSA in this branch. Right now there's a new HSA backend, a matmul kernel and several others. With ROCm 6.4, the NPU appeared in Details
Agent 3 Name: aie2 Seems like recently a lot of things related to the NPU are happening behind the scenes and I believe we'll see llama.cpp running on the NPU sooner or later. Unfortunately it's mostly news for Strix Point/Halo/Krackan, and not previous generations like Phoenix/Hawk. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Enhancement
Are there any plans to support the AMD XDNA AI Engine (in AMD Ryzen 7x40 (x = 6,8,9) processors)?
The text was updated successfully, but these errors were encountered: