Skip to content

context : restore preemptive sched reset when LLAMA_SET_ROWS=0 #14870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 25, 2025

Conversation

ggerganov
Copy link
Member

fix #14863

@slaren
Copy link
Member

slaren commented Jul 25, 2025

Model Test t/s master t/s gg/sched-reset-preemt Speedup
llama 7B Q4_0 pp512 5676.37 5614.36 0.99
llama 7B Q4_0 tg128 123.60 148.16 1.20

@ggerganov ggerganov merged commit c1dbea7 into master Jul 25, 2025
54 of 55 checks passed
@ggerganov ggerganov deleted the gg/sched-reset-preemt branch July 25, 2025 11:28
taronaeo pushed a commit to taronaeo/llama.cpp-s390x that referenced this pull request Jul 25, 2025
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jul 25, 2025
* origin/master:
docs : update HOWTO‑add‑model.md for ModelBase and new model classes (ggml-org#14874)
ggml : remove invalid portPos specifiers from dot files (ggml-org#14838)
context : restore preemptive sched reset when LLAMA_SET_ROWS=0 (ggml-org#14870)
mtmd : fix 32-bit narrowing issue in export-lora and mtmd clip (ggml-org#14503)
rpc : check for null buffers in get/set/copy tensor endpoints (ggml-org#14868)
sched : fix multiple evaluations of the same graph with pipeline parallelism (ggml-org#14855)
musa: upgrade musa sdk to rc4.2.0 (ggml-org#14498)
sync : ggml
cmake : fix usage issues (ggml/1257)
ggml-cpu : remove stdlib include from repack.cpp (ggml/1276)
context : perform output reorder lazily upon access after sync (ggml-org#14853)
chat : fix kimi-k2 chat template (ggml-org#14852)
sycl: fixed semantics of block offset calculation (ggml-org#14814)
llama : fix MiniCPM inference after Granite Four changes (ggml-org#14850)
docs: add libcurl-dev install hint for Linux distros (ggml-org#14801)
metal : fix fusion across different encoders (ggml-org#14849)
sycl: fix undefined variable in work group size check (ggml-org#14843)
convert : text-only support for GLM-4.1V-9B-Thinking (ggml-org#14823)
CUDA: fix overflow in FA, tune performance (ggml-org#14840)
CUDA: fix compilation with GGML_CUDA_F16 (ggml-org#14837)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance regression with multiple GPUs in commit 01612b7
2 participants