-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Feature Request: Nemotron-4-340B-Instruct Support #7966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Well even if its not something that most can run at home, it would still be really useful for people who can deploy it. Big GPUs can be rented in the cloud. This model feels to me like its going to be a game changer! llama.cpp is simply the least headache inducing a way of running any LLM, Renting for this model is going to be expensive and not having to fiddle with jank is nice. I also wonder how well the AMD MI300x would be. |
According to the fact that the Q4 quantized 34B model requires 20g RAM, |
I started working on this a few days ago and so far it's going well. Will post the code on github branch after cleaning it up a bit. |
My code for brave souls:
There is a new tokenizer in the code (it's SentencePiece BPE modified to handle user tokens). It's done this way because when I ran the original model in the NeMo framework it passed the whole prompt to SentencePiece tokenizer as a single string without any special token preprocessing, so I did the same (parse_special is currently hardcoded to false). I wonder if it's possible to do it in a simpler way without adding a new tokenizer, I need to research this a bit more.
|
Getting an error when trying to run the NeMo > safetensors conversion script:
|
@leafspark I have no idea what's wrong, maybe try installing the exact versions of packages that I used: |
Ended up fixing it by bypassing the |
@leafspark But this 847249408 number looks worrying (it's the length of the tensor data buffer), make sure that your model is fully downloaded. This tensor shall have buffer size of 9437184000 (there are 8 files in model.embedding.word_embeddings.weight directory, each file with 1179648000 bytes). |
I verified the sha256 of all the files; they matched, but unfortunately I was unable to find the issue. (I assume the Windows build of safetensors has a problem) For anyone else having the same issue I used failspy's original safetensors and wrote a script to rename the tensors. |
How much overlap is there between Nemotron and Mistral NeMo? https://mistral.ai/news/mistral-nemo/ The Mistral blog post says that the model was developed in conjunction with NVidia, so it looks like it might be related...? I'm rather unfamiliar, so having a hard time telling how much overlap there is between the two models. #8577 is set up to track NeMo support in llama.cpp. |
I think there's basically no overlap between the two. The only thing in common is that it's possible to run them in NVIDIA NeMo framework, but it doesn't imply anything specific. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Any plans to add support at some point in the future? Or should this be considered a WONTFIX? |
Prerequisites
Feature Description
A super-huge new model from Nvidia
https://huggingface.co/nvidia/Nemotron-4-340B-Instruct
Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. It is a fine-tuned version of the Nemotron-4-340B-Base model, optimized for English-based single and multi-turn chat use-cases. It supports a context length of 4,096 tokens.
Motivation
Because the mountain was there.
But it may have no practical value because of the ratio of performance to price.
Possible Implementation
No response
The text was updated successfully, but these errors were encountered: