Closed
Description
Roughly after b1412, the Server does not answers anymore using llama-2-70b-chat; while still answers using Mistral-0.1.
Going back the version solves the issue...
I'm happy to test any versions / or even give access to hardware if needed
Server output shows no error - but not printing timings
Oct 24 14:42:38 llama server[981998]: llama_new_context_with_model: freq_scale = 1
Oct 24 14:42:39 llama server[981998]: llama_kv_cache_init: offloading v cache to GPU
Oct 24 14:42:39 llama server[981998]: llama_kv_cache_init: offloading k cache to GPU
Oct 24 14:42:39 llama server[981998]: llama_kv_cache_init: VRAM kv self = 160.00 MB
Oct 24 14:42:39 llama server[981998]: llama_new_context_with_model: kv self size = 160.00 MB
Oct 24 14:42:39 llama server[981998]: llama_new_context_with_model: compute buffer total size = 151.13 MB
Oct 24 14:42:39 llama server[981998]: llama_new_context_with_model: VRAM scratch buffer: 145.00 MB
Oct 24 14:42:39 llama server[981998]: llama_new_context_with_model: total VRAM used: 46627.61 MB (model: 46322.61 MB, context: 305.00 MB)
Oct 24 14:42:39 llama server[981998]: llama server listening at http://127.0.0.1:8081
Oct 24 14:42:39 llama server[981998]: {"timestamp":1698158559,"level":"INFO","function":"main","line":1746,"message":"HTTP server listening","hostname":"127.0.0.1","port":8081}
Oct 24 14:42:51 llama server[981998]: llama_print_timings: load time = 13864.16 ms
Oct 24 14:42:51 llama server[981998]: llama_print_timings: sample time = 25.96 ms / 60 runs ( 0.43 ms per token, 2310.80 tokens per second)
Oct 24 14:42:51 llama server[981998]: llama_print_timings: prompt eval time = 1022.36 ms / 138 tokens ( 7.41 ms per token, 134.98 tokens per second)
Oct 24 14:42:51 llama server[981998]: llama_print_timings: eval time = 3947.52 ms / 59 runs ( 66.91 ms per token, 14.95 tokens per second)
Oct 24 14:42:51 llama server[981998]: llama_print_timings: total time = 5013.52 ms
Oct 24 14:42:51 llama server[981998]: {"timestamp":1698158571,"level":"INFO","function":"log_server_request","line":1233,"message":"request","remote_addr":"127.0.0.1","remote_port":58996,"status":200,"method":"POST","path":"/completion","params":{}}
Oct 24 14:37:13 llama server[981689]: llama_new_context_with_model: freq_scale = 1
Oct 24 14:37:13 llama server[981689]: llama_kv_cache_init: offloading v cache to GPU
Oct 24 14:37:13 llama server[981689]: llama_kv_cache_init: offloading k cache to GPU
Oct 24 14:37:13 llama server[981689]: llama_kv_cache_init: VRAM kv self = 160.00 MB
Oct 24 14:37:13 llama server[981689]: llama_new_context_with_model: kv self size = 160.00 MB
Oct 24 14:37:13 llama server[981689]: llama_new_context_with_model: compute buffer total size = 151.13 MB
Oct 24 14:37:13 llama server[981689]: llama_new_context_with_model: VRAM scratch buffer: 145.00 MB
Oct 24 14:37:13 llama server[981689]: llama_new_context_with_model: total VRAM used: 46627.61 MB (model: 46322.61 MB, context: 305.00 MB)
Oct 24 14:37:13 llama server[981689]: Available slots:
Oct 24 14:37:13 llama server[981689]: -> Slot 0 - max context: 512
Oct 24 14:37:13 llama server[981689]: llama server listening at http://127.0.0.1:8081
Oct 24 14:37:13 llama server[981689]: {"timestamp":1698158233,"level":"INFO","function":"main","line":2499,"message":"HTTP server listening","hostname":"127.0.0.1","port":8081}
Oct 24 14:37:13 llama server[981689]: all slots are idle and system prompt is empty, clear the KV cache
Oct 24 14:37:22 llama server[981689]: slot 0 is processing [task id: 0]
Oct 24 14:37:22 llama server[981689]: slot 0 : kv cache rm - [0, end)
Oct 24 14:37:22 llama server[981689]: slot 0 released (138 tokens in cache)
Oct 24 14:37:22 llama server[981689]: {"timestamp":1698158242,"level":"INFO","function":"log_server_request","line":2163,"message":"request","remote_addr":"127.0.0.1","remote_port":42486,"status":200,"method":"POST","path":"/completion","params":{}}