Description
Hey there,
thank you for your great work on llama.cpp!
I am using it in my bachelors thesis to build a LLM benchmarking tool. To make use of big GPUs, I am running a llama.cpp server in the background, and generate HTTP requests with multiple threads. This way I get much faster execution times.
For my benchmarks it is important to get deterministic results when prompting the model. I therefore set the temperature to 0 and disable other samplers. When closely inspecting the returned completions and logits across multiple runs, I realised that they are, however, not deterministic.
I have created the proof of concept below, which I executed on an H100. It spawns a llama.cpp server with 8 slots, and sends the same prompt to each slot using multiple threads. I expected all completions to be identical, but they were not. When running the script multiple times, I get between 5 and 8 unique completion texts when using 8 slots. If everything were completely deterministic, there should only be a single unique completion text in this case.
When using a single slot, I always get the same answer, but still small variations in the logits. They don't seem to be big enough to cause different tokens to be selected though.
I am currently running llama.cpp version b2774. I experienced this behavior on my MacBook Pro M1, as well as on an H100 and an A100. I also got this behavior with different models and different quantizations. It seems to me that the output just gets more random the more slots I use.
Can anyone explain to me what is happening here? Is there a way to force the outputs to be deterministic? Am I missing something here?
I would really appreciate any help!
Best
Leon
import json
import threading
import time
from pathlib import Path
from queue import Queue
from typing import List
from requests import Session
import subprocess
from tqdm import tqdm
def create_completion(prompt: str, slot_id: int):
request = {
"prompt": prompt,
"id_slot": slot_id, # ensure that a thread only uses its own server slot
"n_predict": 128,
"n_probs": 1,
"temperature": 0,
"samplers": ["temperature"],
"seed": 1234,
"repeat_last_n": 0,
"min_p": 0.0,
"top_p": 1.0,
"top_k": 100,
"repeat_penalty": 1.0,
"mirostat_eta": 0.0,
"mirostat_tau": 0.0,
"cache_prompt": False
}
raw_completion_response = session.post(url=completion_url, headers=headers, json=request).json()
return raw_completion_response["content"]
def run_subset(thread_id: int, prompts: List[str], output_queue: Queue, shared_progressbar: tqdm):
for prompt in prompts:
response = create_completion(prompt, thread_id)
output_queue.put(response)
shared_progressbar.update(1)
def run_all(prompts: List[str]):
threads = []
output_queue = Queue()
def distribute_chunks(data, num_threads):
n = len(data)
chunk_size = n // num_threads
remainder = n % num_threads
chunks = []
start = 0
for thread_id in range(num_threads):
end = start + chunk_size + (1 if thread_id < remainder else 0)
chunks.append(data[start:end])
start = end
return chunks
chunks = distribute_chunks(data=prompts, num_threads=n_parallel)
shared_progressbar = tqdm(total=len(prompts), desc=f"Prompting model on {n_parallel} server slots.")
for i in range(n_parallel):
thread = threading.Thread(target=run_subset, args=(i, chunks[i], output_queue, shared_progressbar))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
shared_progressbar.close()
all_results: List[str] = []
while not output_queue.empty():
all_results.append(output_queue.get())
return all_results
if __name__ == '__main__':
prompts = [
"Once upon a time..."
]
n_parallel = 8
server_binary_path = Path("../llama.cpp/build/bin/server")
model_path = Path("../models/llama-2-7b-chat.Q4_K_M.gguf")
completion_url = "http://localhost:8080/completion"
headers = {'content-type': 'application/json'}
session: Session = Session()
kill_all_old_servers()
# spawn a new server
server_process_arguments = [
str(server_binary_path),
"-m", str(model_path),
"-b", "1024",
"-c", "8192",
"-ngl", "1000",
"-np", str(n_parallel)
]
process = subprocess.Popen(server_process_arguments, stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT)
time.sleep(2) # wait for the server to start
results = run_all(prompts=prompts*16)
unique_results = len(list(set(results)))
print(json.dumps(unique_results, indent=2))
process.terminate()