-
Notifications
You must be signed in to change notification settings - Fork 11.8k
65B model giving incorect output #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
|
same happens with 30B model: `ubuntu@ip-x:~/llama.cpp$ ./main -m ./models/30B/ggml-model-q4_0.bin \
main: prompt: 'The history of humanity starts with the bing bang, then ' sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000 The history of humanity starts with the bing bang, then derivative Óingusch Mes cinemadated UN impactalogftillingosph médecVERSION possessionanampionțiaět disappearedment PK UN forg derivative trouveance gentlemanIABotIABot soortán boxes médeciumblica'} após Squad бан occas grayеньist whitespace савезнојselectedine cavalley vagueembankaely Cardülés ej clas notify hescaught insgesamtaftnm meses soort prep Easterningists derivativeeriaAG Bundes cinema Mes surrillingftanosphVERSIONliershashamp possessioniana disappearedment PKIABot UNuschumably trouvelinewidthadersː notify gentlemanionxpafán Squad Ram splitting succIABot médecеньět após савезнојium banksotenist банanka^Z` |
I'm getting the same result. 7B and 13B work fine, but 30B and 65B produce garbage.
|
I get the same 'nonsense' with the 7B model. For the 7B model, my md5sum of consolidated.00.pth is 6efc8dab194ab59e49cd24be5574d85e as it ought to be, according to checklist.chk |
I can confirm that 7B and 13B work for me. 30B and 65B are the ones not giving correct output. |
fp16 and 4-bit quantized working for me for 30B and 65B models. I haven't run the smaller models:
|
can you share the size of the files as well? and also a successfully executed example with both models? Thanks! |
I just pulled the latest code and will regression check the output with all 4-bit models:
|
Note that as per @ggerganov's correction to my observation in issue #95, the number of threads and other subtleties such as different floating point implementations may prevent us from reproducing the exact same output, even given that same random seed:
|
gjmulder - could you give your md5sum values for the weights you downloaded please - e.g. consolidated.00.pth - then I can see if I am starting off with the right values. Thanks. |
OK, now I get sensible results with 7B model. ################################################################### ./main -m ./models/7B/ggml-model-q4_0.bin -t 16 -n 1000000 -p 'The history of humanity starts with the bing bang, then ' system_info: n_threads = 16 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | main: prompt: 'The history of humanity starts with the bing bang, then ' sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000 The history of humanity starts with the bing bang, then 7 generations later there is agriculture and cities are created. Eventually we get to where we are today; technology being our savior! main: mem per token = 14565444 bytes |
The conversion and quantization should be deterministic, so if the bin files don't match the pth files won't match:
|
Does reducing top_p to something like 0.3 or even 0.1 provide better output for these larger models? |
0.3 to 0.5 looks to be better, especially for the smaller models. The "10 simple steps" looks to be a useful prompt to test the each model's ability to count consecutively at different top_p settings:
|
I also explored |
I can confirm that the lastest branch (March 15 2023) works for all models. You will have to redo the quantization to make it work if you had problems. |
then 65B model uses 31% of 128 GB RAM when performing inference |
example output:
|
The text was updated successfully, but these errors were encountered: