-
Notifications
You must be signed in to change notification settings - Fork 625
Open
Labels
module: benchmarkIssues related to the benchmark infrastructureIssues related to the benchmark infrastructuremodule: ciIssues related to continuous integrationIssues related to continuous integrationtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Describe the bug
The job starts failing since last Thursday: https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=export-models%20(meta-llama&mergeLF=true
It looks like the root cause is this PR: #7927
Stacktrace:
2025-01-31T00:34:23.3001996Z + DOWNLOADED_PATH=/var/lib/ci-user/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct-QLORA_INT4_EO8/snapshots/3fdf98b6bc1069f632a468b0676299a0a1b65071
2025-01-31T00:34:23.3007472Z + python -m examples.models.llama.export_llama --model llama3_2 --checkpoint /var/lib/ci-user/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct-QLORA_INT4_EO8/snapshots/3fdf98b6bc1069f632a468b0676299a0a1b65071/consolidated.00.pth --params /var/lib/ci-user/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct-QLORA_INT4_EO8/snapshots/3fdf98b6bc1069f632a468b0676299a0a1b65071/params.json -qat -lora 16 --preq_mode 8da4w_output_8da8w --preq_group_size 32 --preq_embedding_quantize 8,0 --use_sdpa_with_kv_cache -kv -X --xnnpack-extended-ops -d fp32 --max_seq_length 2048 --output_name llama-3.2-1b-instruct-qlora-int4-eo8_llama3_qlora.pte --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
2025-01-31T00:34:23.3012177Z Traceback (most recent call last):
2025-01-31T00:34:23.3012854Z File "/opt/conda/envs/py_3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2025-01-31T00:34:23.3013599Z return _run_code(code, main_globals, None,
2025-01-31T00:34:23.3014243Z File "/opt/conda/envs/py_3.10/lib/python3.10/runpy.py", line 86, in _run_code
2025-01-31T00:34:23.3014870Z exec(code, run_globals)
2025-01-31T00:34:23.3015517Z File "/pytorch/executorch/examples/models/llama/export_llama.py", line 32, in <module>
2025-01-31T00:34:23.3016259Z main() # pragma: no cover
2025-01-31T00:34:23.3016899Z File "/pytorch/executorch/examples/models/llama/export_llama.py", line 28, in main
2025-01-31T00:34:23.3017583Z export_llama(args)
2025-01-31T00:34:23.3018265Z File "/pytorch/executorch/examples/models/llama/export_llama_lib.py", line 540, in export_llama
2025-01-31T00:34:23.3019069Z builder = _export_llama(args)
2025-01-31T00:34:23.3019817Z File "/pytorch/executorch/examples/models/llama/export_llama_lib.py", line 677, in _export_llama
2025-01-31T00:34:23.3020602Z _validate_args(args)
2025-01-31T00:34:23.3021305Z File "/pytorch/executorch/examples/models/llama/export_llama_lib.py", line 650, in _validate_args
2025-01-31T00:34:23.3022094Z raise ValueError(
2025-01-31T00:34:23.3023561Z ValueError: max_context_length 128 must be >= max_seq_len 2048. max_context_length impacts kv cache size that is used to remember history, while max_seq_length refers to user prompt length. Please use --max_context_length to specify context length.
2025-01-31T00:34:23.3046001Z ##[error]Process completed with exit code 1.
Versions
trunk
Metadata
Metadata
Assignees
Labels
module: benchmarkIssues related to the benchmark infrastructureIssues related to the benchmark infrastructuremodule: ciIssues related to continuous integrationIssues related to continuous integrationtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Type
Projects
Status
No status
Milestone
Relationships
Development
Select code repository
Activity
guangy10 commentedon Feb 3, 2025
@kimishpatel looks like it's from #7927. Can you take a look?
kimishpatel commentedon Feb 4, 2025
I can probably find where the perf jobs are but it will be much easier if you pointed me to them. Else will dig around tomorrow
guangy10 commentedon Feb 4, 2025
@kimishpatel The link is already attached when filing the task. Follow the link on top of this issue, it will direct you to the HUD where you can see all the failed jobs due to this issue. Click into any of the ❌ job then you will see the full stack trace.
mergennachin commentedon Feb 4, 2025
@kimishpatel - i think you need to also update the READMEs
https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md
https://github.com/pytorch/executorch/blob/main/examples/demo-apps/apple_ios/LLaMA/docs/delegates/xnnpack_README.md
https://github.com/pytorch/executorch/blob/main/examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md
kimishpatel commentedon Feb 5, 2025
lol. I was asking for source code for the job that lists different steps. And while I could dig it up, which it seems I will ahve to do, I was just hoping that you will just point me to some shell script or something else that I could update.
Good point on the readme file. Will update
guangy10 commentedon Feb 5, 2025
Aha, here the source of the workflow that lists the steps:
guangy10 commentedon Feb 10, 2025
@kimishpatel These jobs have been failing for a week, can you priority to fix it?
kimishpatel commentedon Feb 11, 2025
#8374