Skip to content

add option to run mmlu with 5 shots #6146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Conversation

helunwencser
Copy link
Contributor

@helunwencser helunwencser commented Oct 10, 2024

Stack from ghstack (oldest at bottom):

This PR does the following changes:

  • add --num_fewshot option which is required for running MMLU task with 5 shots
  • set the default value of --limit to none such that we can actually run all examples
  • update eval_llama to call simple_evaluate which is a wrapper of evaluate and does some extra work for us like getting the task dict

Test Plan:

  • Make sure WikiText perplexity for llama 3.2 1B stays the same before and after the change.

Before, run eval_llama for llama 3.2 1B with limit set to None:

wikitext: {'word_perplexity,none': 12.78246428138387, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.610432252171856, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6874479705552373, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

After, run eval_llama for llama 3.2 1B:

wikitext: {'word_perplexity,none': 12.78246428138387, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.610432252171856, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6874479705552373, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}
  • Make sure that lm_eval(v0.4.2, which is used by eval_llama) and eval_llama reports similar number for llama 3.2 1B and 3B BF16 for MMLU task with 5 shots.

Example command for lm_eval:

lm_eval --model hf \
   --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct \
   --tasks mmlu \
   --device cuda \
   -f 5 \
   --batch_size auto

Example command for eval_llama:

python -m examples.models.llama2.eval_llama \
	-c /home/lunwenh/models/1B_Instruct/consolidated.00.pth \
	-p /home/lunwenh/models/1B_Instruct/params.json \
	-t /home/lunwenh/models/1B_Instruct/tokenizer.model \
	-kv \
	-d bf16 \
	--tasks mmlu \
        -f 5 \
	--max_seq_length 2048

Differential Revision: D64215268

Copy link

pytorch-bot bot commented Oct 10, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6146

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 03b9346 with merge base df5b2ab (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

helunwencser added a commit that referenced this pull request Oct 10, 2024
ghstack-source-id: 9312bc1
Pull Request resolved: #6146
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 10, 2024
@helunwencser helunwencser changed the base branch from gh/helunwencser/43/base to main October 10, 2024 23:35
@helunwencser
Copy link
Contributor Author

@helunwencser has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

This PR does the following changes:
- add `--num_fewshot` option which is required for running MMLU task with 5 shots
- set the default value of `--limit` to none such that we can actually run all examples
- update `eval_llama` to call `simple_evaluate` which is a wrapper of `evaluate` and does some extra work for us like getting the task dict

Test Plan:
- Make sure WikiText perplexity for llama 3.2 1B stays the same before and after the change.

Before, run eval_llama for llama 3.2 1B with limit set to None:
```
wikitext: {'word_perplexity,none': 12.78246428138387, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.610432252171856, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6874479705552373, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}
```
After, run eval_llama for llama 3.2 1B:
```
wikitext: {'word_perplexity,none': 12.78246428138387, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.610432252171856, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6874479705552373, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}
```
- Make sure that lm_eval(v0.4.2, which is used by eval_llama) and eval_llama reports similar number for llama 3.2 1B and 3B BF16 for MMLU task with 5 shots.

Example command for lm_eval:
```
lm_eval --model hf \
   --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct \
   --tasks mmlu \
   --device cuda \
   -f 5 \
   --batch_size auto
```
Example command for eval_llama:
```
python -m examples.models.llama2.eval_llama \
	-c /home/lunwenh/models/1B_Instruct/consolidated.00.pth \
	-p /home/lunwenh/models/1B_Instruct/params.json \
	-t /home/lunwenh/models/1B_Instruct/tokenizer.model \
	-kv \
	-d bf16 \
	--tasks mmlu \
        -f 5 \
	--max_seq_length 2048
```

Differential Revision: [D64215268](https://our.internmc.facebook.com/intern/diff/D64215268)

[ghstack-poisoned]
helunwencser added a commit that referenced this pull request Oct 11, 2024
ghstack-source-id: d36ec02
Pull Request resolved: #6146
@helunwencser
Copy link
Contributor Author

@helunwencser has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mergennachin mergennachin self-requested a review October 11, 2024 15:28
This PR does the following changes:
- add `--num_fewshot` option which is required for running MMLU task with 5 shots
- set the default value of `--limit` to none such that we can actually run all examples
- update `eval_llama` to call `simple_evaluate` which is a wrapper of `evaluate` and does some extra work for us like getting the task dict

Test Plan:
- Make sure WikiText perplexity for llama 3.2 1B stays the same before and after the change.

Before, run eval_llama for llama 3.2 1B with limit set to None:
```
wikitext: {'word_perplexity,none': 12.78246428138387, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.610432252171856, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6874479705552373, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}
```
After, run eval_llama for llama 3.2 1B:
```
wikitext: {'word_perplexity,none': 12.78246428138387, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.610432252171856, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6874479705552373, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}
```
- Make sure that lm_eval(v0.4.2, which is used by eval_llama) and eval_llama reports similar number for llama 3.2 1B and 3B BF16 for MMLU task with 5 shots.

Example command for lm_eval:
```
lm_eval --model hf \
   --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct \
   --tasks mmlu \
   --device cuda \
   -f 5 \
   --batch_size auto
```
Example command for eval_llama:
```
python -m examples.models.llama2.eval_llama \
	-c /home/lunwenh/models/1B_Instruct/consolidated.00.pth \
	-p /home/lunwenh/models/1B_Instruct/params.json \
	-t /home/lunwenh/models/1B_Instruct/tokenizer.model \
	-kv \
	-d bf16 \
	--tasks mmlu \
        -f 5 \
	--max_seq_length 2048
```

Differential Revision: [D64215268](https://our.internmc.facebook.com/intern/diff/D64215268)

[ghstack-poisoned]
helunwencser added a commit that referenced this pull request Oct 11, 2024
ghstack-source-id: 72a2e9c
Pull Request resolved: #6146
@helunwencser
Copy link
Contributor Author

@helunwencser has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

This PR does the following changes:
- add `--num_fewshot` option which is required for running MMLU task with 5 shots
- set the default value of `--limit` to none such that we can actually run all examples
- update `eval_llama` to call `simple_evaluate` which is a wrapper of `evaluate` and does some extra work for us like getting the task dict

Test Plan:
- Make sure WikiText perplexity for llama 3.2 1B stays the same before and after the change.

Before, run eval_llama for llama 3.2 1B with limit set to None:
```
wikitext: {'word_perplexity,none': 12.78246428138387, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.610432252171856, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6874479705552373, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}
```
After, run eval_llama for llama 3.2 1B:
```
wikitext: {'word_perplexity,none': 12.78246428138387, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.610432252171856, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6874479705552373, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}
```
- Make sure that lm_eval(v0.4.2, which is used by eval_llama) and eval_llama reports similar number for llama 3.2 1B and 3B BF16 for MMLU task with 5 shots.

Example command for lm_eval:
```
lm_eval --model hf \
   --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct \
   --tasks mmlu \
   --device cuda \
   -f 5 \
   --batch_size auto
```
Example command for eval_llama:
```
python -m examples.models.llama2.eval_llama \
	-c /home/lunwenh/models/1B_Instruct/consolidated.00.pth \
	-p /home/lunwenh/models/1B_Instruct/params.json \
	-t /home/lunwenh/models/1B_Instruct/tokenizer.model \
	-kv \
	-d bf16 \
	--tasks mmlu \
        -f 5 \
	--max_seq_length 2048
```

Differential Revision: [D64215268](https://our.internmc.facebook.com/intern/diff/D64215268)

[ghstack-poisoned]
@helunwencser
Copy link
Contributor Author

@helunwencser has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@helunwencser merged this pull request in e95aa9d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants