add option to run mmlu with 5 shots #6146

helunwencser · 2024-10-10T23:33:00Z

Stack from ghstack (oldest at bottom):

This PR does the following changes:

add --num_fewshot option which is required for running MMLU task with 5 shots
set the default value of --limit to none such that we can actually run all examples
update eval_llama to call simple_evaluate which is a wrapper of evaluate and does some extra work for us like getting the task dict

Test Plan:

Make sure WikiText perplexity for llama 3.2 1B stays the same before and after the change.

Before, run eval_llama for llama 3.2 1B with limit set to None:

wikitext: {'word_perplexity,none': 12.78246428138387, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.610432252171856, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6874479705552373, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

After, run eval_llama for llama 3.2 1B:

wikitext: {'word_perplexity,none': 12.78246428138387, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.610432252171856, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6874479705552373, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Make sure that lm_eval(v0.4.2, which is used by eval_llama) and eval_llama reports similar number for llama 3.2 1B and 3B BF16 for MMLU task with 5 shots.

Example command for lm_eval:

lm_eval --model hf \
   --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct \
   --tasks mmlu \
   --device cuda \
   -f 5 \
   --batch_size auto

Example command for eval_llama:

python -m examples.models.llama2.eval_llama \
	-c /home/lunwenh/models/1B_Instruct/consolidated.00.pth \
	-p /home/lunwenh/models/1B_Instruct/params.json \
	-t /home/lunwenh/models/1B_Instruct/tokenizer.model \
	-kv \
	-d bf16 \
	--tasks mmlu \
        -f 5 \
	--max_seq_length 2048

Differential Revision: D64215268

[ghstack-poisoned]

pytorch-bot · 2024-10-10T23:33:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6146

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 03b9346 with merge base df5b2ab ():

NEW FAILURE - The following job has failed:

pull / unittest-arm (buck2) / linux-job (gh)
../../dev/ops/test_add.py::TestSimpleAdd::test_add2_tosa_BI_4

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 9312bc1 Pull Request resolved: #6146

helunwencser · 2024-10-11T00:09:58Z

@helunwencser has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

This PR does the following changes: - add `--num_fewshot` option which is required for running MMLU task with 5 shots - set the default value of `--limit` to none such that we can actually run all examples - update `eval_llama` to call `simple_evaluate` which is a wrapper of `evaluate` and does some extra work for us like getting the task dict Test Plan: - Make sure WikiText perplexity for llama 3.2 1B stays the same before and after the change. Before, run eval_llama for llama 3.2 1B with limit set to None: ``` wikitext: {'word_perplexity,none': 12.78246428138387, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.610432252171856, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6874479705552373, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} ``` After, run eval_llama for llama 3.2 1B: ``` wikitext: {'word_perplexity,none': 12.78246428138387, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.610432252171856, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6874479705552373, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} ``` - Make sure that lm_eval(v0.4.2, which is used by eval_llama) and eval_llama reports similar number for llama 3.2 1B and 3B BF16 for MMLU task with 5 shots. Example command for lm_eval: ``` lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct \ --tasks mmlu \ --device cuda \ -f 5 \ --batch_size auto ``` Example command for eval_llama: ``` python -m examples.models.llama2.eval_llama \ -c /home/lunwenh/models/1B_Instruct/consolidated.00.pth \ -p /home/lunwenh/models/1B_Instruct/params.json \ -t /home/lunwenh/models/1B_Instruct/tokenizer.model \ -kv \ -d bf16 \ --tasks mmlu \ -f 5 \ --max_seq_length 2048 ``` Differential Revision: [D64215268](https://our.internmc.facebook.com/intern/diff/D64215268) [ghstack-poisoned]

ghstack-source-id: d36ec02 Pull Request resolved: #6146

helunwencser · 2024-10-11T00:52:40Z

@helunwencser has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

This PR does the following changes: - add `--num_fewshot` option which is required for running MMLU task with 5 shots - set the default value of `--limit` to none such that we can actually run all examples - update `eval_llama` to call `simple_evaluate` which is a wrapper of `evaluate` and does some extra work for us like getting the task dict Test Plan: - Make sure WikiText perplexity for llama 3.2 1B stays the same before and after the change. Before, run eval_llama for llama 3.2 1B with limit set to None: ``` wikitext: {'word_perplexity,none': 12.78246428138387, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.610432252171856, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6874479705552373, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} ``` After, run eval_llama for llama 3.2 1B: ``` wikitext: {'word_perplexity,none': 12.78246428138387, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.610432252171856, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6874479705552373, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} ``` - Make sure that lm_eval(v0.4.2, which is used by eval_llama) and eval_llama reports similar number for llama 3.2 1B and 3B BF16 for MMLU task with 5 shots. Example command for lm_eval: ``` lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct \ --tasks mmlu \ --device cuda \ -f 5 \ --batch_size auto ``` Example command for eval_llama: ``` python -m examples.models.llama2.eval_llama \ -c /home/lunwenh/models/1B_Instruct/consolidated.00.pth \ -p /home/lunwenh/models/1B_Instruct/params.json \ -t /home/lunwenh/models/1B_Instruct/tokenizer.model \ -kv \ -d bf16 \ --tasks mmlu \ -f 5 \ --max_seq_length 2048 ``` Differential Revision: [D64215268](https://our.internmc.facebook.com/intern/diff/D64215268) [ghstack-poisoned]

ghstack-source-id: 72a2e9c Pull Request resolved: #6146

helunwencser · 2024-10-11T17:34:46Z

@helunwencser has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

This PR does the following changes: - add `--num_fewshot` option which is required for running MMLU task with 5 shots - set the default value of `--limit` to none such that we can actually run all examples - update `eval_llama` to call `simple_evaluate` which is a wrapper of `evaluate` and does some extra work for us like getting the task dict Test Plan: - Make sure WikiText perplexity for llama 3.2 1B stays the same before and after the change. Before, run eval_llama for llama 3.2 1B with limit set to None: ``` wikitext: {'word_perplexity,none': 12.78246428138387, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.610432252171856, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6874479705552373, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} ``` After, run eval_llama for llama 3.2 1B: ``` wikitext: {'word_perplexity,none': 12.78246428138387, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.610432252171856, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6874479705552373, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} ``` - Make sure that lm_eval(v0.4.2, which is used by eval_llama) and eval_llama reports similar number for llama 3.2 1B and 3B BF16 for MMLU task with 5 shots. Example command for lm_eval: ``` lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct \ --tasks mmlu \ --device cuda \ -f 5 \ --batch_size auto ``` Example command for eval_llama: ``` python -m examples.models.llama2.eval_llama \ -c /home/lunwenh/models/1B_Instruct/consolidated.00.pth \ -p /home/lunwenh/models/1B_Instruct/params.json \ -t /home/lunwenh/models/1B_Instruct/tokenizer.model \ -kv \ -d bf16 \ --tasks mmlu \ -f 5 \ --max_seq_length 2048 ``` Differential Revision: [D64215268](https://our.internmc.facebook.com/intern/diff/D64215268) [ghstack-poisoned]

helunwencser · 2024-10-11T18:42:20Z

@helunwencser has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-10-11T19:52:06Z

@helunwencser merged this pull request in e95aa9d.

add option to run mmlu with 5 shots

49b23bb

[ghstack-poisoned]

helunwencser added a commit that referenced this pull request Oct 10, 2024

add option to run mmlu with 5 shots

9580ad8

ghstack-source-id: 9312bc1 Pull Request resolved: #6146

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 10, 2024

helunwencser changed the base branch from gh/helunwencser/43/base to main October 10, 2024 23:35

helunwencser added a commit that referenced this pull request Oct 11, 2024

add option to run mmlu with 5 shots

a6f742b

ghstack-source-id: d36ec02 Pull Request resolved: #6146

mergennachin self-requested a review October 11, 2024 15:28

mergennachin approved these changes Oct 11, 2024

View reviewed changes

helunwencser added a commit that referenced this pull request Oct 11, 2024

add option to run mmlu with 5 shots

e636e91

ghstack-source-id: 72a2e9c Pull Request resolved: #6146

helunwencser mentioned this pull request Oct 11, 2024

add instructions about getting mmlu score for instruct models #6173

Closed

helunwencser mentioned this pull request Oct 11, 2024

add instructions about getting mmlu score for instruct models #6175

Closed

facebook-github-bot closed this in e95aa9d Oct 11, 2024

facebook-github-bot added the Merged label Oct 11, 2024

winskuo-quic mentioned this pull request Oct 21, 2024

Qualcomm AI Engine Direct - Update the evaluator API call for Llama #6386

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add option to run mmlu with 5 shots #6146

add option to run mmlu with 5 shots #6146

Uh oh!

helunwencser commented Oct 10, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 10, 2024 •

edited

Loading

Uh oh!

helunwencser commented Oct 11, 2024

Uh oh!

helunwencser commented Oct 11, 2024

Uh oh!

helunwencser commented Oct 11, 2024

Uh oh!

helunwencser commented Oct 11, 2024

Uh oh!

facebook-github-bot commented Oct 11, 2024

Uh oh!

Uh oh!

add option to run mmlu with 5 shots #6146

add option to run mmlu with 5 shots #6146

Uh oh!

Conversation

helunwencser commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6146

❌ 1 New Failure

Uh oh!

helunwencser commented Oct 11, 2024

Uh oh!

helunwencser commented Oct 11, 2024

Uh oh!

helunwencser commented Oct 11, 2024

Uh oh!

helunwencser commented Oct 11, 2024

Uh oh!

facebook-github-bot commented Oct 11, 2024

Uh oh!

Uh oh!

helunwencser commented Oct 10, 2024 •

edited

Loading

pytorch-bot bot commented Oct 10, 2024 •

edited

Loading