Skip to content

Bug: Server Empty Responses after ~b1412 using CUDA and llama-2-70b #3761

Closed
@adrianliechti

Description

@adrianliechti

Roughly after b1412, the Server does not answers anymore using llama-2-70b-chat; while still answers using Mistral-0.1.
Going back the version solves the issue...

I'm happy to test any versions / or even give access to hardware if needed

Server output shows no error - but not printing timings

Oct 24 14:42:38 llama server[981998]: llama_new_context_with_model: freq_scale = 1
Oct 24 14:42:39 llama server[981998]: llama_kv_cache_init: offloading v cache to GPU
Oct 24 14:42:39 llama server[981998]: llama_kv_cache_init: offloading k cache to GPU
Oct 24 14:42:39 llama server[981998]: llama_kv_cache_init: VRAM kv self = 160.00 MB
Oct 24 14:42:39 llama server[981998]: llama_new_context_with_model: kv self size  =  160.00 MB
Oct 24 14:42:39 llama server[981998]: llama_new_context_with_model: compute buffer total size = 151.13 MB
Oct 24 14:42:39 llama server[981998]: llama_new_context_with_model: VRAM scratch buffer: 145.00 MB
Oct 24 14:42:39 llama server[981998]: llama_new_context_with_model: total VRAM used: 46627.61 MB (model: 46322.61 MB, context: 305.00 MB)
Oct 24 14:42:39 llama server[981998]: llama server listening at http://127.0.0.1:8081
Oct 24 14:42:39 llama server[981998]: {"timestamp":1698158559,"level":"INFO","function":"main","line":1746,"message":"HTTP server listening","hostname":"127.0.0.1","port":8081}
Oct 24 14:42:51 llama server[981998]: llama_print_timings:        load time =   13864.16 ms
Oct 24 14:42:51 llama server[981998]: llama_print_timings:      sample time =      25.96 ms /    60 runs   (    0.43 ms per token,  2310.80 tokens per second)
Oct 24 14:42:51 llama server[981998]: llama_print_timings: prompt eval time =    1022.36 ms /   138 tokens (    7.41 ms per token,   134.98 tokens per second)
Oct 24 14:42:51 llama server[981998]: llama_print_timings:        eval time =    3947.52 ms /    59 runs   (   66.91 ms per token,    14.95 tokens per second)
Oct 24 14:42:51 llama server[981998]: llama_print_timings:       total time =    5013.52 ms
Oct 24 14:42:51 llama server[981998]: {"timestamp":1698158571,"level":"INFO","function":"log_server_request","line":1233,"message":"request","remote_addr":"127.0.0.1","remote_port":58996,"status":200,"method":"POST","path":"/completion","params":{}}
Oct 24 14:37:13 llama server[981689]: llama_new_context_with_model: freq_scale = 1
Oct 24 14:37:13 llama server[981689]: llama_kv_cache_init: offloading v cache to GPU
Oct 24 14:37:13 llama server[981689]: llama_kv_cache_init: offloading k cache to GPU
Oct 24 14:37:13 llama server[981689]: llama_kv_cache_init: VRAM kv self = 160.00 MB
Oct 24 14:37:13 llama server[981689]: llama_new_context_with_model: kv self size  =  160.00 MB
Oct 24 14:37:13 llama server[981689]: llama_new_context_with_model: compute buffer total size = 151.13 MB
Oct 24 14:37:13 llama server[981689]: llama_new_context_with_model: VRAM scratch buffer: 145.00 MB
Oct 24 14:37:13 llama server[981689]: llama_new_context_with_model: total VRAM used: 46627.61 MB (model: 46322.61 MB, context: 305.00 MB)
Oct 24 14:37:13 llama server[981689]: Available slots:
Oct 24 14:37:13 llama server[981689]:  -> Slot 0 - max context: 512
Oct 24 14:37:13 llama server[981689]: llama server listening at http://127.0.0.1:8081
Oct 24 14:37:13 llama server[981689]: {"timestamp":1698158233,"level":"INFO","function":"main","line":2499,"message":"HTTP server listening","hostname":"127.0.0.1","port":8081}
Oct 24 14:37:13 llama server[981689]: all slots are idle and system prompt is empty, clear the KV cache
Oct 24 14:37:22 llama server[981689]: slot 0 is processing [task id: 0]
Oct 24 14:37:22 llama server[981689]: slot 0 : kv cache rm - [0, end)
Oct 24 14:37:22 llama server[981689]: slot 0 released (138 tokens in cache)
Oct 24 14:37:22 llama server[981689]: {"timestamp":1698158242,"level":"INFO","function":"log_server_request","line":2163,"message":"request","remote_addr":"127.0.0.1","remote_port":42486,"status":200,"method":"POST","path":"/completion","params":{}}

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions