Skip to content

Add max-prefill-length argument in distillation dataset generation script #1748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 19, 2025

Conversation

SurbhiJainUSC
Copy link
Collaborator

@SurbhiJainUSC SurbhiJainUSC commented May 15, 2025

Description

This PR introduces max-prefill-length argument to the script that is used to generate dataset for distillation. This argument will be used to filter out prompt sequences that are larger than max-prefill-length before running inference.

Notice 1: Once all tests pass, the "pull ready" label will automatically be assigned.
This label is used for administrative purposes. Please do not add it manually.

Notice 2: For external contributions, our settings currently require an approval from a MaxText maintainer to trigger CI tests.

Tests

  • Tested with generate_distillation_dataset.py

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

Copy link
Collaborator

@hengtaoguo hengtaoguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Surbhi!

Question out of curiosity: which model are you experimenting with this SFT, and what is roughly the maximum prefill/max length supported on the chip you are using?

Asking because of the multimodal work, even the shortest prompt Describe image <start_of_image> will require 272 tokens in the prefill length. Mentioning this so that the group can be aware of, thanks!

@SurbhiJainUSC
Copy link
Collaborator Author

Thank you Surbhi!

Question out of curiosity: which model are you experimenting with this SFT, and what is roughly the maximum prefill/max length supported on the chip you are using?

Asking because of the multimodal work, even the shortest prompt Describe image <start_of_image> will require 272 tokens in the prefill length. Mentioning this so that the group can be aware of, thanks!

I am using the base configurations for both: https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/configs/base.yml#L470

Copy link
Collaborator

@shralex shralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Surbhi. Added a suggestion

@copybara-service copybara-service bot merged commit 12014ea into main May 19, 2025
26 of 28 checks passed
@copybara-service copybara-service bot deleted the distillation branch May 19, 2025 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants