Skip to content

regression: output is nonsense with latest commit and CUDA support enabled #7451

@enolan

Description

@enolan

On 201cc11, I get gibberish output trying to sample from Llama-3-8B quantized with Q5_K_M (same behavior with Q8_0, F16, F32, and Q4_K_M). This happens when llama.cpp is built with CUDA support, but not without. I'm building these with Nix. Here's an example output:

enolan@chonk ~/j/llama.cpp (master)> ./result-cuda-201cc11a/bin/llama -m /bulk/LLaMA/Meta-Llama-3-8B/ggml-model-q5_k_m.gguf -p "'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather." -s 1 --color -n 64                                                                                                                                         

Log start                                                                                                                                                                                                                                     
main: build = 0 (unknown)                                                                                                                                                                                                                     
main: built with gcc (GCC) 12.3.0 for x86_64-unknown-linux-gnu                                                         
main: seed  = 1                                                                                                        
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /bulk/LLaMA/Meta-Llama-3-8B/ggml-model-q5_k_m.gguf (version GGUF V3 (latest))                                                                               
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.                      
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 17
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00                                                                                                                                                                                               
llm_load_print_meta: f_logit_scale    = 0.0e+00                                                                                                                                                                                               
llm_load_print_meta: n_ff             = 14336                                                                          
llm_load_print_meta: n_expert         = 0                                                                              
llm_load_print_meta: n_expert_used    = 0                                                                                                                                                                                                     
llm_load_print_meta: causal attn      = 1                                                                              
llm_load_print_meta: pooling type     = 0                                                                              
llm_load_print_meta: rope type        = 0                                                                              
llm_load_print_meta: rope scaling     = linear                                                                         
llm_load_print_meta: freq_base_train  = 500000.0                                                                       
llm_load_print_meta: freq_scale_train = 1                                                                              
llm_load_print_meta: n_yarn_orig_ctx  = 8192                                                                           
llm_load_print_meta: rope_finetuned   = unknown                                                                        
llm_load_print_meta: ssm_d_conv       = 0                                                                              
llm_load_print_meta: ssm_d_inner      = 0                                                                              
llm_load_print_meta: ssm_d_state      = 0                                                                              
llm_load_print_meta: ssm_dt_rank      = 0                                                                              
llm_load_print_meta: model type       = 8B                                                                             
llm_load_print_meta: model ftype      = Q5_K - Medium                                                                  
llm_load_print_meta: model params     = 8.03 B                                                                         
llm_load_print_meta: model size       = 5.33 GiB (5.70 BPW)                                          
llm_load_print_meta: general.name     = Meta-Llama-3-8B                                                                                                                                                                                       
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'                                                                                                                                                                            
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'                                                                                                                                                                              
llm_load_print_meta: LF token         = 128 'Ä'                                                                        
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'                                       
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no                                                                              
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes   
ggml_cuda_init: found 1 CUDA devices:        
  Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB                                                                          
llm_load_tensors: offloading 0 repeating layers to GPU  
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  5459.93 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   669.48 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     9.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 356
                                                           
system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = 64, n_keep = 0


<|begin_of_text|>'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather. Annapolis, a town in Maryland, has the highest concentration of naval officers in the US. It was once home to the US Naval Academy, the most prominent naval academy in the country. In 2011, the academy was moved to Washington, DC, and the Naval Academy has since been renamed as the Naval
llama_print_timings:        load time =     453.35 ms
llama_print_timings:      sample time =       5.42 ms /    64 runs   (    0.08 ms per token, 11816.84 tokens per second)
llama_print_timings: prompt eval time =     562.33 ms /    48 tokens (   11.72 ms per token,    85.36 tokens per second)
llama_print_timings:        eval time =    9444.97 ms /    63 runs   (  149.92 ms per token,     6.67 tokens per second)
llama_print_timings:       total time =   10052.76 ms /   111 tokens
Log end

It starts talking about Annapolis, Maryland for some reason, instead of fabric. Other seeds are also nonsense, either gibberish or a nonsensical change of topic. In contrast, CPU only build is fine:

enolan@chonk ~/j/llama.cpp (master)> ./result-cpuonly-201cc11a/bin/llama -m /bulk/LLaMA/Meta-Llama-3-8B/ggml-model-q5_k_m.gguf -p "'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather." -s 1 --color -n 64               
Log start                                                                                                                                                                                                                                     
main: build = 0 (unknown)                                                                                                                                                                                                                     
main: built with gcc (GCC) 13.2.0 for x86_64-unknown-linux-gnu                                                                                                                                                                                
main: seed  = 1                                                                                                        
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /bulk/LLaMA/Meta-Llama-3-8B/ggml-model-q5_k_m.gguf (version GGUF V3 (latest))                                                                               
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.                      
llama_model_loader: - kv   0:                       general.architecture str              = llama                      
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B            
llama_model_loader: - kv   2:                          llama.block_count u32              = 32                         
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192                       
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096                       
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336                      
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32                         
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8                          
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000              
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010                   
llama_model_loader: - kv  10:                          general.file_type u32              = 17                         
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256                     
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128                        
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2                       
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe                  
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...                                                                                                          
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...                                                                                                          
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...                                                                                                                    
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000                     
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00                                                                        
llm_load_print_meta: f_max_alibi_bias = 0.0e+00                                                                                                                                                                                               
llm_load_print_meta: f_logit_scale    = 0.0e+00                                                                                                                                                                                               
llm_load_print_meta: n_ff             = 14336                                                                                                                                                                                                 
llm_load_print_meta: n_expert         = 0                                                                              
llm_load_print_meta: n_expert_used    = 0                                                                                                                                                                                                     
llm_load_print_meta: causal attn      = 1                                                                              
llm_load_print_meta: pooling type     = 0                                                                              
llm_load_print_meta: rope type        = 0                                                                              
llm_load_print_meta: rope scaling     = linear                                                                         
llm_load_print_meta: freq_base_train  = 500000.0                                                                       
llm_load_print_meta: freq_scale_train = 1                                                                              
llm_load_print_meta: n_yarn_orig_ctx  = 8192                                                                           
llm_load_print_meta: rope_finetuned   = unknown                                                                        
llm_load_print_meta: ssm_d_conv       = 0                                                                              
llm_load_print_meta: ssm_d_inner      = 0                                                                              
llm_load_print_meta: ssm_d_state      = 0                                                                                                                                                                                                     
llm_load_print_meta: ssm_dt_rank      = 0                                                                                                                                                                                                     
llm_load_print_meta: model type       = 8B                                                                                                                                                                                                    
llm_load_print_meta: model ftype      = Q5_K - Medium                                                                  
llm_load_print_meta: model params     = 8.03 B                                                                                                                                                                                                
llm_load_print_meta: model size       = 5.33 GiB (5.70 BPW)                                                                                                                                                                                   
llm_load_print_meta: general.name     = Meta-Llama-3-8B                                                                                                                                                                                       
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'                                                                                                                                                                            
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'                                                                                                                                                                              
llm_load_print_meta: LF token         = 128 'Ä'                                                                                                                                                                                               
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'                                                            
llm_load_tensors: ggml ctx size =    0.15 MiB                                                                          
llm_load_tensors:        CPU buffer size =  5459.93 MiB                                                                                                                                                                                       
.........................................................................................                                                                                                                                                     
llama_new_context_with_model: n_ctx      = 512       
llama_new_context_with_model: n_batch    = 512                                                                         
llama_new_context_with_model: n_ubatch   = 512                                                                         
llama_new_context_with_model: flash_attn = 0                                                                           
llama_new_context_with_model: freq_base  = 500000.0                                                                                                                                                                                           
llama_new_context_with_model: freq_scale = 1                                                                           
llama_kv_cache_init:        CPU KV buffer size =    64.00 MiB                                                          
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB                  
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB                                            
llama_new_context_with_model:        CPU compute buffer size =   258.50 MiB                                            
llama_new_context_with_model: graph nodes  = 1030                                                                                                                                                                                             
llama_new_context_with_model: graph splits = 1                                                                                                                                                                                                
                                                                                                                       
system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |                                                                                                                                                                                          
sampling:                                                                                                                                                                                                                                     
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000                                                                                                                                       
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800                       
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000                                                        
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature                                       
generate: n_ctx = 512, n_batch = 2048, n_predict = 64, n_keep = 0                                                                                                                                                                             
                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                              
<|begin_of_text|>'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather.  '''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather.                                                                           
+ Seersucker fabrics are woven with extra threads of yarn, which are left                                              
llama_print_timings:        load time =     380.38 ms                                                                  
llama_print_timings:      sample time =       4.86 ms /    64 runs   (    0.08 ms per token, 13179.57 tokens per second)
llama_print_timings: prompt eval time =    1803.44 ms /    48 tokens (   37.57 ms per token,    26.62 tokens per second)
llama_print_timings:        eval time =    9420.80 ms /    63 runs   (  149.54 ms per token,     6.69 tokens per second)
llama_print_timings:       total time =   11269.28 ms /   111 tokens                                                   
Log end                                                                                                                

It's repeating itself, but it at least makes sense. 6369bf0 (the previous commit) is fine for CUDA (and CPU):

enolan@chonk ~/j/llama.cpp (master)> ./result-cuda-6369bf04/bin/llama -m /bulk/LLaMA/Meta-Llama-3-8B/ggml-model-q5_k_m.gguf -p "'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather." -s 1 --color -n 64                                                                                                                                         Log start                                            
main: build = 0 (unknown)                      
main: built with gcc (GCC) 12.3.0 for x86_64-unknown-linux-gnu                                                         
main: seed  = 1                                                                                                        
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /bulk/LLaMA/Meta-Llama-3-8B/ggml-model-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama                      
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B            
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096                                                                                                                                              llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336                                                                                                                                             
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000                                                                                                                                     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010                                                                                                                                          
llama_model_loader: - kv  10:                          general.file_type u32              = 17                                                                                                                                                
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256                     
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2                                                                                                                                              
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00                                                                                                                                                                                               llm_load_print_meta: f_max_alibi_bias = 0.0e+00      
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336                                                                          
llm_load_print_meta: n_expert         = 0                                                                              
llm_load_print_meta: n_expert_used    = 0                                                                              
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0                                                                              
llm_load_print_meta: rope type        = 0                                                                              
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1                                                                                                                                                                                                     llm_load_print_meta: n_yarn_orig_ctx  = 8192                                                                                                                                                                                                  
llm_load_print_meta: rope_finetuned   = unknown           
llm_load_print_meta: ssm_d_conv       = 0            
llm_load_print_meta: ssm_d_inner      = 0                                                                                                                                                                                                     
llm_load_print_meta: ssm_d_state      = 0                                                                                                                                                                                                     
llm_load_print_meta: ssm_dt_rank      = 0                                                                                                                                                                                                     
llm_load_print_meta: model type       = 8B                                                                             
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 8.03 B                                                                                                                                                                                                
llm_load_print_meta: model size       = 5.33 GiB (5.70 BPW)  
llm_load_print_meta: general.name     = Meta-Llama-3-8B
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>' 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  5459.93 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   669.48 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     9.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 356

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = 64, n_keep = 0


<|begin_of_text|>'''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather.  '''Seersucker''' or '''railroad stripe''' is a thin, puckered, usually [[cotton]] [[textile|fabric]], commonly but not necessarily striped or chequered, used to make clothing for hot weather.
+ Seersucker fabrics are woven with extra "bunching" yarns
llama_print_timings:        load time =     453.15 ms
llama_print_timings:      sample time =       5.00 ms /    64 runs   (    0.08 ms per token, 12812.81 tokens per second)
llama_print_timings: prompt eval time =     560.83 ms /    48 tokens (   11.68 ms per token,    85.59 tokens per second)
llama_print_timings:        eval time =    9432.22 ms /    63 runs   (  149.72 ms per token,     6.68 tokens per second)
llama_print_timings:       total time =   10037.80 ms /   111 tokens
Log end

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions