Skip to content

Commit 1f2b9aa

Browse files
helunwencserfacebook-github-bot
authored andcommitted
add instructions about getting mmlu score for instruct models (#6175)
Summary: Pull Request resolved: #6175 imported-using-ghimport Test Plan: Imported from OSS Reviewed By: mergennachin Differential Revision: D64256005 Pulled By: helunwencser fbshipit-source-id: b799d311cde065bbbf94f389c1c407c3b59b1da2
1 parent 5512fe0 commit 1f2b9aa

File tree

1 file changed

+24
-4
lines changed

1 file changed

+24
-4
lines changed

examples/models/llama2/README.md

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ We employed 4-bit groupwise per token dynamic quantization of all the linear lay
4949

5050
We evaluated WikiText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness). Please note that LM Eval reports perplexity normalized by word count instead of token count. You may see different perplexity for WikiText from other sources if they implement it differntly. More details could be found [here](https://github.com/EleutherAI/lm-evaluation-harness/issues/2301).
5151

52-
Below are the results for two different groupsizes, with max_seq_len 2048, and 1000 samples.
52+
Below are the results for two different groupsizes, with max_seq_length 2048, and limit 1000.
5353

5454
|Model | Baseline (FP32) | Groupwise 4-bit (128) | Groupwise 4-bit (256)
5555
|--------|-----------------| ---------------------- | ---------------
@@ -280,12 +280,32 @@ tokenizer.path=<path_to_checkpoint_folder>/tokenizer.model
280280

281281
> Forewarning: Model evaluation without a GPU may take a long time, especially on larger models.
282282
283-
Using the same arguments from above
283+
We use [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate model accuracy.
284+
285+
For base models, use the following example command to calculate its perplexity based on WikiText.
284286
```
285-
python -m examples.models.llama2.eval_llama -c <checkpoint.pth> -p <params.json> -t <tokenizer.model/bin> -d fp32 --max_seq_len <max sequence length> --limit <number of samples>
287+
python -m examples.models.llama2.eval_llama \
288+
-c <checkpoint.pth> \
289+
-p <params.json> \
290+
-t <tokenizer.model/bin> \
291+
-kv \
292+
-d <checkpoint dtype> \
293+
--max_seq_len <max sequence length> \
294+
--limit <number of samples>
286295
```
287296

288-
The Wikitext results generated above used: `{max_seq_len: 2048, limit: 1000}`
297+
For instruct models, use the following example command to calculate its MMLU score.
298+
```
299+
python -m examples.models.llama2.eval_llama \
300+
-c <checkpoint.pth> \
301+
-p <params.json> \
302+
-t <tokenizer.model/bin> \
303+
-kv \
304+
-d <checkpoint dtype> \
305+
--tasks mmlu \
306+
--num_fewshot 5 \
307+
--max_seq_len <max sequence length>
308+
```
289309

290310
## Step 4: Run on your computer to validate
291311

0 commit comments

Comments
 (0)