Skip to content

Conversation

lhez
Copy link
Collaborator

@lhez lhez commented Mar 18, 2025

  • Wait for profiling events and collect profiling data when model execution is done. This way, the displayed performance numbers are more close to the true performance.
  • Generate a chrome trace in addition to csv.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 18, 2025
@lhez lhez marked this pull request as ready for review March 18, 2025 04:50
@jeffzhou2000
Copy link

jeffzhou2000 commented Mar 18, 2025

sorry to bother you, how can I mark a specified PR as ready for review? thanks.
one more thing, can Qualcomm's expert help to review a PR of Qualcomm's QNN backend for llama.cpp from an independent programmer(here is me and I already bought Qualcomm's Snapdragon 8gen3 equipped Android phone for personal learning dev activity and plan to buy a Snapdragon 8 Elite equipped phone for further dev activity later)?I know your team already has a senior technical expert whom knows everything about ggml-qnn and I hope can get help from your colleague. thanks too much.

Copy link
Collaborator

@max-krasnyansky max-krasnyansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, and reminds me that I wanted to integrate graph-profiler branch with opencl and cpu backends.

@max-krasnyansky max-krasnyansky merged commit d84635b into ggml-org:master Mar 18, 2025
46 of 47 checks passed
@max-krasnyansky
Copy link
Collaborator

sorry to bother you, how can I mark a specified PR as ready for review? thanks.

If you start a Draft PR there is a way to mark it ready. qnn-backend PR is not marked as draft.

one more thing, can Qualcomm's expert help to review a PR of Qualcomm's QNN backend for llama.cpp from an independent programmer(here is me and I already bought Qualcomm's Snapdragon 8gen3 equipped Android phone for personal learning dev activity and plan to buy a Snapdragon 8 Elite equipped phone for further dev activity later)?I know your team already has a senior technical expert whom knows everything about ggml-qnn and I hope can get help from your colleague. thanks too much.

I've been keep an eye on it. In general, I'd say QNN is not the right solution here but I'll take another look.

@jeffzhou2000
Copy link

jeffzhou2000 commented Mar 19, 2025

thanks for your valuable time and attention on ggml-qnn(Qualcomm's backend for llama.cpp through Qualcomm QNN SDK) and my third formal PR of ggml-qnn.

  1. is "QNN is not the right solution here" means we can't utilize the Hexagon NPU maximally through Qualcomm's QNN SDK in llama.cpp? even through the "second tech approach(mapping the entire ggml cgraph to a single QNN graph)"?

  2. that's why the GPU performance in that PR(or all similar PR) is so bad at the same time Intel's ggml-sycl is so good? accordingly, that's why Qualcomm provide ggml-opencl for Qualcomm's GPU backend(right solution for Qualcomm GPU backend in llama.cpp)?

  3. that's why the NPU performance in that PR(or all similar PR) is so bad?

  4. that's why Qualcomm didn't provide an official ggml-qnn PR through the general approach in Intel's ggml-sycl or the second tech approach(mapping the entire ggml cgraph to a single QNN graph) with QNN SDK currently?

  5. from this diagram,
    qualcomm-qnn-sdk

we can see the key-point of "how to utilize the Hexagon NPU maximally through QNN SDK" is that we should compose an ideal single QNN graph from an entire/complete ggml cgraph. as my personal understanding, that's exactly the core principle(Qualcomm's dedicated binary tool convert a specified LLM model to a prepared QNN graph) in QNN SampleApp or Genie stack.
Screenshot from 2025-03-19 08-51-23

Screenshot from 2025-03-19 08-28-59

I don't understand some tech questiones:

  • why "QNN is not the right solution here"? is there any other AI software stack or SDK provided by Qualcomm for llama.cpp's Hexagon NPU backend?
  • or the first tech approach(general approach in Intel's ggml-sycl) and the second tech approach(mapping a complete/entire ggml cgraph to a single QNN graph through QNN SDK) in that PR are both incorrect?
  • or the second tech approach(mapping the entire/complete ggml cgraph to a single QNN graph through QNN SDK) is correct path but the practice in that PR is not exactly correct? I understand the shared-buffer or memory-pool also should be used in the second tech approach.

I or/and the entire llama.cpp community need your further guidance and help.

thanks so much again.

@jeffzhou2000
Copy link

jeffzhou2000 commented Mar 20, 2025

@max-krasnyansky, thanks so much for your valuable guidance/correction on direction.

I think I know something about the third tech approach of "utilize the Hexagon NPU maximally", in other words, the Hexagon DSP SDK should be used in the third tech approach, which is exactly similar to what your excellent engineering team did with ggml-opencl, or which is exactly similar to what I did with video decoding hardware acceleration many years ago(it's also a DSP chip).

my guess might-be not correct, accordingly, it's greatly appreciated that you can give me/the llama.cpp community a clear explanation or a roughly confirmation of the third tech approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants