Misc. bug: The model's reasoning performance has significantly decreased despite using different versions of the same model architecture, identical parameters, and the same set of questions.

### Name and Version

built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

llama.cpp-b4702

llama.cpp-b4751

llama.cpp-b4756 **************

llama.cpp-b4759 **************

llama.cpp-b4761

llama.cpp-b4762

llama.cpp-b4769

llama.cpp-b4775

llama.cpp-b4800

llama.cpp-b4900

llama.cpp-b4940

llama.cpp-b4990

llama.cpp-b5026

llama.cpp-b5030



### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-server, llama-cli

### Command line

```shell
./build/bin/llama-server -m /data/qwq-32b-q8_0-00001-of-00009.gguf -fa -s 3047 --temp 0.6 --top-p 0.95  -ngl 100 --host 0.0.0.0 -c 131072
```

### Problem description & steps to reproduce

Testing with Different Versions of the llama.cpp Server for the Same Inference Task

Using two versions of the llama.cpp server to address the same problem:

llama.cpp-b4756
llama.cpp-b4759
Both versions employ identical parameters and models, yet exhibit significant performance differences.

Key observations:

Performance degradation:

b4759 is noticeably less capable than b4756 (performing worse than twice as poorly in some cases).
Token consumption for the same task:
b4756: ~3,000 tokens
b4759: ~6,000 tokens
Version comparison:

b4702 (an older version) shows superior performance compared to b4756.
The test problem used:

**Can you help me decrypt this cipher I received?
"K nkmg rncakpi hqqvdcnn."**

This behavior is reproducible through multiple tests. After extensive testing, version b4759 was identified as the one with drastically degraded performance.

If you can reproduce similar findings, please share your test cases!

### First Bad Commit

_No response_

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: The model's reasoning performance has significantly decreased despite using different versions of the same model architecture, identical parameters, and the same set of questions. #12816

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: The model's reasoning performance has significantly decreased despite using different versions of the same model architecture, identical parameters, and the same set of questions. #12816

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions