[DeepSeek R1] Qwen2.5 Distillations #2236
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a distinct family of Qwen2.5 models, distilled from DeepSeek-R1:
While technically distillations, DeepSeek's configurations make changes to the tokenizer config and preprocessing flow. To avoid the flag-based slippery slope of adding overriding configs to existing Qwen models, as well as to complement #2171, we separate the tokenizer and preprocessor, adding the distinct changes introduced with DeepSeek-R1's distillation as separate classes and files.
Example Usage
Google Colab
2-line setup/prompt on Google Colab:
Keras-Hub
HuggingFace Equivalent
Numerical Equivalency
Currently, there seems to be noise in the numerics/weights when naively converting. Still looking into why this is happening.
Though, they're generally comparable. For example, taking the mean across the first axis of the
lm_head
(calledtoken_embedding
in KerasHub - we see a fairly similar profile, but not numerical equivalency:This doesn't seem to affect the outputs that much though, as seen above in the responses. I'll investigate further into why these discrepancies arise - since they should be directly loading the weights as they are into the structure of the model's components.