Skip to content

QLora and SpintQuant recipes fail to export on CI #8154

@guangy10

Description

@guangy10
Contributor

🐛 Describe the bug

The job starts failing since last Thursday: https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=export-models%20(meta-llama&mergeLF=true

It looks like the root cause is this PR: #7927

Stacktrace:

2025-01-31T00:34:23.3001996Z + DOWNLOADED_PATH=/var/lib/ci-user/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct-QLORA_INT4_EO8/snapshots/3fdf98b6bc1069f632a468b0676299a0a1b65071
2025-01-31T00:34:23.3007472Z + python -m examples.models.llama.export_llama --model llama3_2 --checkpoint /var/lib/ci-user/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct-QLORA_INT4_EO8/snapshots/3fdf98b6bc1069f632a468b0676299a0a1b65071/consolidated.00.pth --params /var/lib/ci-user/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct-QLORA_INT4_EO8/snapshots/3fdf98b6bc1069f632a468b0676299a0a1b65071/params.json -qat -lora 16 --preq_mode 8da4w_output_8da8w --preq_group_size 32 --preq_embedding_quantize 8,0 --use_sdpa_with_kv_cache -kv -X --xnnpack-extended-ops -d fp32 --max_seq_length 2048 --output_name llama-3.2-1b-instruct-qlora-int4-eo8_llama3_qlora.pte --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
2025-01-31T00:34:23.3012177Z Traceback (most recent call last):
2025-01-31T00:34:23.3012854Z   File "/opt/conda/envs/py_3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2025-01-31T00:34:23.3013599Z     return _run_code(code, main_globals, None,
2025-01-31T00:34:23.3014243Z   File "/opt/conda/envs/py_3.10/lib/python3.10/runpy.py", line 86, in _run_code
2025-01-31T00:34:23.3014870Z     exec(code, run_globals)
2025-01-31T00:34:23.3015517Z   File "/pytorch/executorch/examples/models/llama/export_llama.py", line 32, in <module>
2025-01-31T00:34:23.3016259Z     main()  # pragma: no cover
2025-01-31T00:34:23.3016899Z   File "/pytorch/executorch/examples/models/llama/export_llama.py", line 28, in main
2025-01-31T00:34:23.3017583Z     export_llama(args)
2025-01-31T00:34:23.3018265Z   File "/pytorch/executorch/examples/models/llama/export_llama_lib.py", line 540, in export_llama
2025-01-31T00:34:23.3019069Z     builder = _export_llama(args)
2025-01-31T00:34:23.3019817Z   File "/pytorch/executorch/examples/models/llama/export_llama_lib.py", line 677, in _export_llama
2025-01-31T00:34:23.3020602Z     _validate_args(args)
2025-01-31T00:34:23.3021305Z   File "/pytorch/executorch/examples/models/llama/export_llama_lib.py", line 650, in _validate_args
2025-01-31T00:34:23.3022094Z     raise ValueError(
2025-01-31T00:34:23.3023561Z ValueError: max_context_length 128 must be >= max_seq_len 2048. max_context_length impacts kv cache size that is used to remember history, while max_seq_length refers to user prompt length. Please use --max_context_length to specify context length.
2025-01-31T00:34:23.3046001Z ##[error]Process completed with exit code 1.

Versions

trunk

Activity

added
module: benchmarkIssues related to the benchmark infrastructure
module: ciIssues related to continuous integration
on Feb 3, 2025
guangy10

guangy10 commented on Feb 3, 2025

@guangy10
ContributorAuthor

@kimishpatel looks like it's from #7927. Can you take a look?

added
triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
on Feb 4, 2025
kimishpatel

kimishpatel commented on Feb 4, 2025

@kimishpatel
Contributor

@kimishpatel looks like it's from #7927. Can you take a look?

I can probably find where the perf jobs are but it will be much easier if you pointed me to them. Else will dig around tomorrow

guangy10

guangy10 commented on Feb 4, 2025

@guangy10
ContributorAuthor

@kimishpatel looks like it's from #7927. Can you take a look?

I can probably find where the perf jobs are but it will be much easier if you pointed me to them. Else will dig around tomorrow

@kimishpatel The link is already attached when filing the task. Follow the link on top of this issue, it will direct you to the HUD where you can see all the failed jobs due to this issue. Click into any of the ❌ job then you will see the full stack trace.

kimishpatel

kimishpatel commented on Feb 5, 2025

@kimishpatel
Contributor

@kimishpatel looks like it's from #7927. Can you take a look?

I can probably find where the perf jobs are but it will be much easier if you pointed me to them. Else will dig around tomorrow

@kimishpatel The link is already attached when filing the task. Follow the link on top of this issue, it will direct you to the HUD where you can see all the failed jobs due to this issue. Click into any of the ❌ job then you will see the full stack trace.

lol. I was asking for source code for the job that lists different steps. And while I could dig it up, which it seems I will ahve to do, I was just hoping that you will just point me to some shell script or something else that I could update.

Good point on the readme file. Will update

guangy10

guangy10 commented on Feb 5, 2025

@guangy10
ContributorAuthor

@kimishpatel looks like it's from #7927. Can you take a look?

I can probably find where the perf jobs are but it will be much easier if you pointed me to them. Else will dig around tomorrow

@kimishpatel The link is already attached when filing the task. Follow the link on top of this issue, it will direct you to the HUD where you can see all the failed jobs due to this issue. Click into any of the ❌ job then you will see the full stack trace.

lol. I was asking for source code for the job that lists different steps. And while I could dig it up, which it seems I will ahve to do, I was just hoping that you will just point me to some shell script or something else that I could update.

Good point on the readme file. Will update

Aha, here the source of the workflow that lists the steps:

guangy10

guangy10 commented on Feb 10, 2025

@guangy10
ContributorAuthor

@kimishpatel These jobs have been failing for a week, can you priority to fix it?

kimishpatel

kimishpatel commented on Feb 11, 2025

@kimishpatel
Contributor

@kimishpatel These jobs have been failing for a week, can you priority to fix it?

#8374

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

module: benchmarkIssues related to the benchmark infrastructuremodule: ciIssues related to continuous integrationtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @digantdesai@mergennachin@kimishpatel@jackzhxng@guangy10

      Issue actions

        QLora and SpintQuant recipes fail to export on CI · Issue #8154 · pytorch/executorch