Replies: 2 comments 2 replies
-
I love experimental stuff!
There aren't that many layers, so you could also just manually specify it as a certain value instead of calculating it as a function of the depth or whatever. I know there's some stuff for automatically learning/discovering other types of hyperparameters. Maybe the per layer change could work similarly. You could also do stuff like go the other way: start off fat, get thin. Be thin at the ends and fat at the middle. Here's a crazy operator idea that might kind of fit with the theme of squishing stuff. Normal broadcasting a 1D tensor with a 2D tensor works sort of like:
but what if you could do something like:
So it would be a sparse 2D tensor where one of its rows was broadcasted for a while, then the next, etc. |
Beta Was this translation helpful? Give feedback.
-
hi @ggerganov, Do you have the links to the previous discussion? Is this something for projects such as Tinyllama or MiniMA? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Description
This is an idea that we discussed recently with @ikawrakow and we are sharing it here in case it is of any interest to the community.
The original LLaMA model has constant hidden dimension throughout all the layers of the model. Based on some observations and intuition about the variable importance of the layers in the LLaMA architecture during quantization, we propose a slight change in the architecture that makes the size of the hidden state variable - i.e. it changes across the layers of the model.
We've brainstormed on a few variations of this idea and it can be expressed in different ways. Below is a schematic that illustrates one possible way to do it:
Left: original LLaMA 7B, Right: LLaMA* with increasing hidden dimension.
L
is the layer index, starting from 1.Note that the LLaMA* model can have ~x2 times less parameters for the same number of layers, depending on the specific implementation
Tensor shape changes, (
L
is the layer index, starting from 1):token_embd.weight
, from[4096, 32000]
to:[128, 32000]
attn_q.weight
,attn_k.weight
, from[4096, 4096]
to:[L*128, L*128]
[L*128, 4096]
attn_v.weight
, from[4096, 4096]
to:[L*128, L*128]
[4096, L*128]
ffn_up
,ffn_gate
, from[4096, 11008]
to[L*128, 11008]
ffn_down
, from[11008, 4096]
to[11008, L*128]
Other shapes are obvious or remain the same.
upscale
operatorLikely, instead of linear increase of the hidden dimension (
L -> L+1
) as shown in the figure, an increase by a factor of 2 everyU
number of layers would be better for example. This is not illustrated in the figure for simplicity, but the alternatives should be obvious. The exact implementation of theupscale
operator is not specified, but ideas from other NN architectures could be borrowed. The most straight-forward upscaling could be down via matrix multiplication with an extra rectangular 2D tensor[in_embd, out_embd]
learned during training for each layer.Variations
U
shape). The main point is that it is no longer constant32
) or change with the size of the hidden dimension128
in the first layer, but other numbers might be tried. The main requirement is to select the hidden size and the increase function such that we end up with roughly half the number of parameters compared to vanilla LLaMAWhy would this work? Why is this better?
Probably it is not. It's just something we intuitively think could lead to an improvement (for example, same or similar performance with less parameters) and we haven't seen proposed yet. If it has already been proposed, then just ignore this post. But other than this, it's just a hypothesis that might or might not be worth looking into.
We think it would be interesting if a small LLaMA* model is trained and compared to a vanilla LLaMA model. Probably the llama2.c project could be utilized for training, or maybe even the train tools in
llama.cpp
could be enough.Beta Was this translation helpful? Give feedback.
All reactions