Skip to content

[DeepSeek R1] Qwen2.5 Distillations #2236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

DavidLandup0
Copy link
Collaborator

@DavidLandup0 DavidLandup0 commented Apr 29, 2025

This PR adds a distinct family of Qwen2.5 models, distilled from DeepSeek-R1:

While technically distillations, DeepSeek's configurations make changes to the tokenizer config and preprocessing flow. To avoid the flag-based slippery slope of adding overriding configs to existing Qwen models, as well as to complement #2171, we separate the tokenizer and preprocessor, adding the distinct changes introduced with DeepSeek-R1's distillation as separate classes and files.

Example Usage

Google Colab

2-line setup/prompt on Google Colab:

image

Keras-Hub

Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:06:05) 
[Clang 13.0.0 (clang-1300.0.29.3)] on darwin
>>> import keras_hub
>>> hf_preset = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
>>> keras_hub_model = keras_hub.models.DeepSeekR1QwenCausalLM.from_preset(f"hf://{hf_preset}")
>>> keras_hub_model.generate("What is Keras?", max_length=24)
'What is Keras? Explain its applications?\nWhat is TensorFlow? Explain its applications?\nAlso Explain TensorFlow.js Applications.\n'

HuggingFace Equivalent

>>> hf_preset = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
>>> deepseek_qwen = AutoModelForCausalLM.from_pretrained(hf_preset)
>>> deepseek_qwen_tokenizer = AutoTokenizer.from_pretrained(hf_preset)

>>> inputs = deepseek_qwen_tokenizer(["What is Keras?"], return_tensors="pt")
>>> outputs = deepseek_qwen.generate(**inputs, max_new_tokens=24)
>>> deepseek_qwen_tokenizer.decode(outputs[0])

'<|begin▁of▁sentence|>What is Keras? What is its purpose? What is Keras used for? What is Keras used for in practice? What is K'

Numerical Equivalency

Currently, there seems to be noise in the numerics/weights when naively converting. Still looking into why this is happening.
Though, they're generally comparable. For example, taking the mean across the first axis of the lm_head (called token_embedding in KerasHub - we see a fairly similar profile, but not numerical equivalency:

>>> ax.plot(keras_hub_model.backbone.token_embedding.get_weights()[0].mean(axis=0), label='KH', alpha=0.5)
>>> ax.plot(deepseek_qwen.lm_head.weight.mean(axis=0).detach().numpy(), label='HF', alpha=0.5)
image

This doesn't seem to affect the outputs that much though, as seen above in the responses. I'll investigate further into why these discrepancies arise - since they should be directly loading the weights as they are into the structure of the model's components.

@pass-lin
Copy link
Contributor

pass-lin commented May 3, 2025

Here's a little suggestion: you can try testing your performance on the math dataset. If the final results are comparable to those achieved by vllm, we can ignore this error.

@mattdangerw
Copy link
Member

@DavidLandup0 sounds like what we really need here is the ability to combine a QwenBackbone with a DeepSeek tokenizer? If so, I think we might be able to relax our requirements so a high level task (e.g. the QwenCausalLM could be using a tokenizer from DeepSeek). This is something I have thought we probably need anyway.

I'll try to make a PR showing the basic loading changes, but lmk what you think conceptually!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants