Skip to content

Regression: "The first main on the moon was " #693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
simplejackcoder opened this issue Apr 1, 2023 · 14 comments
Closed

Regression: "The first main on the moon was " #693

simplejackcoder opened this issue Apr 1, 2023 · 14 comments
Labels
question Further information is requested

Comments

@simplejackcoder
Copy link

I saw a blog post where that prompt was used and now when I try it myself using LlAMA I don't get the same result. It is quite strange.

It keeps telling me the man is 38 years old and then starts going off on a tangent. Could this be a recent regression @ggerganov ?

@simplejackcoder
Copy link
Author

D:\llama.cpp\bin\Release\main.exe -m D:\llama.cpp\models\7B\ggml-model-q4_0.bin -t 16 -n 2048 -p "Tell me very specifically the answer to the question I ask and no additional information. Only output the name. What was the name of the first man on the moon?"
main: seed = 1680392790
llama_model_load: loading model from 'D:\llama.cpp\models\7B\ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: type = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size = 81.25 KB
llama_model_load: mem required = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from 'D:\llama.cpp\models\7B\ggml-model-q4_0.bin'
llama_model_load: model size = 4017.27 MB / num tensors = 291
llama_init_from_file: kv self size = 256.00 MB

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 2048, n_keep = 0

Tell me very specifically the answer to the question I ask and no additional information. Only output the name. What was the name of the first man on the moon?
We are going to have a quiz in class tomorrow, so please study your textbook (Chapters 5-7).

  1. Write down two things you would want to bring with you if

@jart
Copy link
Contributor

jart commented Apr 1, 2023

Note the line:

main: seed = 1680392790

A random seed is used by default to inject a bit of variation. If you want deterministic reproduction, then you need to copy that seed and pass it as ./main -s 1680392790 ....

Enjoy!

@jart jart closed this as completed Apr 1, 2023
@jart jart added the question Further information is requested label Apr 1, 2023
@simplejackcoder
Copy link
Author

Thanks @jart I figured there was something like that, but take a look and see if you can reproduce the output I have using the same seed. I think there is some lurking bug here, not sure if it is because of mmap change.

D:\llama.cpp\bin\Release\main.exe -s 1680393600 -m D:\llama.cpp\models\7B\ggml-model-q4_0.bin -t 16 -n 128 -p "The first man on the moon was "
main: seed = 1680393600
llama_model_load: loading model from 'D:\llama.cpp\models\7B\ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: type = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size = 81.25 KB
llama_model_load: mem required = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from 'D:\llama.cpp\models\7B\ggml-model-q4_0.bin'
llama_model_load: model size = 4017.27 MB / num tensors = 291
llama_init_from_file: kv self size = 256.00 MB

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0

The first man on the moon was 38-year old Neil Armstrong who was a Lieutenant Colonel in the United States Air Force. He was one of two pilots selected by NASA for this historic spaceflight, with the other pilot being David R R Scott who was an American Air Force Major and a test pilot at Edwards Air Force Base (AFB). They were accompanied in spacecraft Lunar Module (LM) Eagle by Command Pilot Michael Collins who was a Captain in United States Navy. Collins had been selected as Command Pilot for the Gemini IX-A spaceflight, which took place on 18 July 1966
llama_print_timings: load time = 1444.87 ms
llama_print_timings: sample time = 34.50 ms / 128 runs ( 0.27 ms per run)
llama_print_timings: prompt eval time = 994.79 ms / 8 tokens ( 124.35 ms per token)
llama_print_timings: eval time = 39516.76 ms / 128 runs ( 308.72 ms per run)
llama_print_timings: total time = 41007.28 ms

@anzz1
Copy link
Contributor

anzz1 commented Apr 2, 2023

Sorry @simplejackcoder for getting your valid issue closed as invalid.
Since this breaking PR #613 , @jart has taken the approach to suppress any and all criticism towards that change including but not limited to closing valid issues as invalid.

As mentioned in #647 , most of the PR does not make any sense.
Also the claimed "loading weights 10-100x faster" comes with huge asterisks like "not on all platforms" and "only on subsequent loads as the first load is actually slower", "memory mapping means that the model will stay behind and eat your memory even after the process is closed, this isn't explained and is currently forced with no way to turn it off" and "hasn't actually been properly tested to even function correctly on all platforms", all of which are deliberately suppressed to push this change through.

The whole process has been very disingenious from the beginning to the point it was pretty much force-pushed through without any proper input or polling the community. I'd say in general, regarding anything, to be wary when only the positives are discussed and where any negative effects are being hushed or even dismissed as invalid. That being said, don't take my word for it and make your own conclusions instead.

And I'm sure there are also people who didn't get such a sour taste in their mouth from this that eventually the mess will get cleared and bugs sorted.

However suppressing bug reports isn't the way to achieve it, as that will just push people further away from contributing for those who still are on the fence about the whole thing.

@anzz1 anzz1 reopened this Apr 2, 2023
@BadisG
Copy link

BadisG commented Apr 2, 2023

@anzz1 So basically, we were required to change the format to ggjt (though "jt" stands for the initials of jart, it was @slaren who wrote almost all of the code in this PR – shouldn't his initials be included instead?), which led to versioning issues. All of this for a feature that doesn't function properly?

It's frustrating to see that @ggerganov isn't addressing this problem. This type of ego-driven decisions could easily lead to this project's collapse!

@oelmekki
Copy link

oelmekki commented Apr 2, 2023

It's frustrating to see that @ggerganov isn't addressing this problem. This type of ego-driven decisions could easily lead to this project's collapse!

That kind of comments is exactly why FOSS projects collapse. A developer publishes some cool stuff, and then you get hordes of users demanding commercial level support for it for free, trying to shame the developer for not giving it to them. If you don't like the new feature, fork the project and fix your own problem, or shut up.

@ggerganov llama.cpp is very useful and I have tons of fun with it, thanks for that. That new model format also makes everything faster on my computer, it's awesome.

@jart
Copy link
Contributor

jart commented Apr 2, 2023

D:\llama.cpp\bin\Release\main.exe -s 1680393600 -m D:\llama.cpp\models\7B\ggml-model-q4_0.bin -t 16 -n 128 -p "The first man on the moon was "
The first man on the moon was 38-year old Neil Armstrong who was a Lieutenant Colonel in the United States Air Force

I can't. What Git revision produces this?

@jart
Copy link
Contributor

jart commented Apr 2, 2023

I also need to know the sha256 checksum of D:\llama.cpp\models\7B\ggml-model-q4_0.bin

@danielzgtg
Copy link
Contributor

danielzgtg commented Apr 2, 2023

not on all platforms

hasn't actually been properly tested to even function correctly on all platforms

mmap should work better on every single platform.

only on subsequent loads as the first load is actually slower

The Windows problem is just because there is code missing to fault in the pages in the correct (linear) order.

eat your memory even after the process is closed

No operating system allows processes to eat memory after they are killed.

memory mapping means that the model will stay behind

The OS previously kept the model behind in its page cache. The model stayed behind before the PR too. The merged PR only made it so that we don't need an extra copy in anonymous private memory.

only the positives are discussed and where any negative effects are being hushed or even dismissed as invalid

The only negative fact is that the preload code for Windows is missing. That can be fixed by a small PR later. EDIT: And the PR is here!: #734

@patrakov
Copy link

patrakov commented Apr 2, 2023

memory mapping means that the model will stay behind and eat your memory even after the process is closed, this isn't explained and is currently forced with no way to turn it off

The statement that the model will stay behind is true. However, it is true not because of the mmap, but because of the OS disk cache that is active no matter what. Operating systems assume that, if some data on the disk was needed once, it will be needed again. This is true both for mmap and for "normal" reads. The OS is ready to drop this cached data as soon as it needs memory for anything else.

What mmap avoids is the need to copy the data from the OS disk cache to the explicitly allocated application buffer, and thus to have it in memory twice (once as a discardable copy in the disk cache, once in the application).

@simplejackcoder
Copy link
Author

simplejackcoder commented Apr 3, 2023

Everyone, I'm not criticizing ANY change, I'm just saying could something have inadvertently be broken?

Commit of llama.cpp -- a717cba

@jart here are the sha256sum output for all 3 files

700df0d3013b703a806d2ae7f1bfb8e59814e3d06ae78be0c66368a50059f33d consolidated.00.pth
0cc0b0a3dc8cd29f005946f8364ac2bbce797e792a40c0fb4114615e4f825976 ggml-model-f16.bin
2dad53e70ca521fedcf9f9be5c26c15df602487a9c008bdafbb2bf8f946b6bf0 ggml-model-q4_0.bin

@xloem
Copy link
Contributor

xloem commented Apr 3, 2023

The two disparate outputs here have different prompts. The first one is treating the model as instruct-tuned; is it? Additionally there is no apparent reason to relate this with mmap.

@simplejackcoder
Copy link
Author

Things are getting stranger. Can someone at least validate what I'm seeing?

I tried the 13B model quantized to int4, and it's giving me very weird answers.

D:\llama.cpp>D:\llama.cpp\bin\Release\main.exe -s 1680516479 -m D:\llama.cpp\models\13B\ggml-model-q4_0.bin -t 16 -n 128 -p "The first man on the moon was "
main: seed = 1680516479
llama_model_load: loading model from 'D:\llama.cpp\models\13B\ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 5120
llama_model_load: n_mult = 256
llama_model_load: n_head = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 13824
llama_model_load: n_parts = 2
llama_model_load: type = 2
llama_model_load: ggml map size = 7759.83 MB
llama_model_load: ggml ctx size = 101.25 KB
llama_model_load: mem required = 9807.93 MB (+ 1608.00 MB per state)
llama_model_load: loading tensors from 'D:\llama.cpp\models\13B\ggml-model-q4_0.bin'
llama_model_load: model size = 7759.39 MB / num tensors = 363
llama_init_from_file: kv self size = 400.00 MB

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0

The first man on the moon was 17-year old Neil Armstrong.

  1. If you could have a house full of anything, what would it be? I’d have a house full of books.^C

@oelmekki
Copy link

oelmekki commented Apr 3, 2023

I have slightly different output for that seed and parameters. I'm not sure if it's of any significance, though, because you seem to be on a Windows environment and I'm on a linux one. If the seed is used to seed a system random function, it's probably not the same one.

$ ./main -s 1680516479 -m models/13B/ggml-model-q4_0.bin -t 16 -n 128 -p "The first man on the moon was "
main: seed = 1680516479
llama_model_load: loading model from 'models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: type    = 2
llama_model_load: ggml map size = 7759.83 MB
llama_model_load: ggml ctx size = 101.25 KB
llama_model_load: mem required  = 9807.93 MB (+ 1608.00 MB per state)
llama_model_load: loading tensors from 'models/13B/ggml-model-q4_0.bin'
llama_model_load: model size =  7759.39 MB / num tensors = 363
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 16 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0


 The first man on the moon was 17-year old Neil Armstrong.
On July 20, 1969, as commander of Apollo 11, he^C

Using different seeds often gives me fantasist ages for Amstrong as well. That being said, I haven't tried that prompt on previous version of the model, so I can't tell if there is any behavior change. In any case, you should not expect similar results to chatGPT : this is a (quantized) 13B parameters model, chatGPT is 175B.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

9 participants