LLaMA* : non-uniform hidden state #4147

ggerganov · 2023-11-20T14:13:08Z

ggerganov
Nov 20, 2023
Maintainer

Description

This is an idea that we discussed recently with @ikawrakow and we are sharing it here in case it is of any interest to the community.

The original LLaMA model has constant hidden dimension throughout all the layers of the model. Based on some observations and intuition about the variable importance of the layers in the LLaMA architecture during quantization, we propose a slight change in the architecture that makes the size of the hidden state variable - i.e. it changes across the layers of the model.

We've brainstormed on a few variations of this idea and it can be expressed in different ways. Below is a schematic that illustrates one possible way to do it:

Left: original LLaMA 7B, Right: LLaMA* with increasing hidden dimension. L is the layer index, starting from 1.
Note that the LLaMA* model can have ~x2 times less parameters for the same number of layers, depending on the specific implementation

Tensor shape changes, (L is the layer index, starting from 1):

token_embd.weight, from [4096, 32000] to: [128, 32000]
attn_q.weight, attn_k.weight, from [4096, 4096] to:
- option 1: [L*128, L*128]
- option 2: [L*128, 4096]
attn_v.weight, from [4096, 4096] to:
- option 1: [L*128, L*128]
- option 2: [4096, L*128]
ffn_up, ffn_gate, from [4096, 11008] to [L*128, 11008]
ffn_down, from [11008, 4096] to [11008, L*128]

Other shapes are obvious or remain the same.

`upscale` operator

Likely, instead of linear increase of the hidden dimension (L -> L+1) as shown in the figure, an increase by a factor of 2 every U number of layers would be better for example. This is not illustrated in the figure for simplicity, but the alternatives should be obvious. The exact implementation of the upscale operator is not specified, but ideas from other NN architectures could be borrowed. The most straight-forward upscaling could be down via matrix multiplication with an extra rectangular 2D tensor [in_embd, out_embd] learned during training for each layer.

Variations

The hidden dimension does not need to increase linearly. It can increase as a power function and remain constants for a few layers at a time. It can also probably decrease and then increase (like an U shape). The main point is that it is no longer constant
The number of attention heads could remain constant (32) or change with the size of the hidden dimension
The hidden state above has dimension of 128 in the first layer, but other numbers might be tried. The main requirement is to select the hidden size and the increase function such that we end up with roughly half the number of parameters compared to vanilla LLaMA

Why would this work? Why is this better?

Probably it is not. It's just something we intuitively think could lead to an improvement (for example, same or similar performance with less parameters) and we haven't seen proposed yet. If it has already been proposed, then just ignore this post. But other than this, it's just a hypothesis that might or might not be worth looking into.

We think it would be interesting if a small LLaMA* model is trained and compared to a vanilla LLaMA model. Probably the llama2.c project could be utilized for training, or maybe even the train tools in llama.cpp could be enough.

KerfuffleV2 · 2023-11-20T15:02:35Z

KerfuffleV2
Nov 20, 2023
Collaborator

I love experimental stuff!

The hidden dimension does not need to increase linearly. It can increase as a power function

There aren't that many layers, so you could also just manually specify it as a certain value instead of calculating it as a function of the depth or whatever. I know there's some stuff for automatically learning/discovering other types of hyperparameters. Maybe the per layer change could work similarly.

You could also do stuff like go the other way: start off fat, get thin. Be thin at the ends and fat at the middle.

Here's a crazy operator idea that might kind of fit with the theme of squishing stuff.

Normal broadcasting a 1D tensor with a 2D tensor works sort of like:

111111 <- 2d tensor
111111
111111
111111

AAAAAA <- 1d tensor

===

111111 AAAAAA
111111 AAAAAA
111111 AAAAAA
111111 AAAAAA

but what if you could do something like:

111111
111111
111111
111111

AAAAAA <- 2d-ish tensor
BBBBBB

===

111111 AAAAAA
111111 AAAAAA
111111 BBBBBB
111111 BBBBBB

So it would be a sparse 2D tensor where one of its rows was broadcasted for a while, then the next, etc.

0 replies

BarfingLemurs · 2023-11-23T10:07:19Z

BarfingLemurs
Nov 23, 2023

hi @ggerganov, Do you have the links to the previous discussion?

Is this something for projects such as Tinyllama or MiniMA?
Is it a modification supporting any finetuned llama model?

2 replies

ggerganov Nov 23, 2023
Maintainer Author

hi @ggerganov, Do you have the links to the previous discussion?

No - it wasn't online.

Is this something for projects such as Tinyllama or MiniMA?

Even Tinyllama is too big - should start with something basic, 100s or 10s of million parameters (cc @xaedes if interested).

Is it a modification supporting any finetuned llama model?

This is for base models - no finetuning is necessary.

Main feedback I've received so far is that the reason for the hidden sate to have the same size is because it makes training easier. Residuals (a.k.a skip connections) are important based on CV / CNN work, but there is new research showing they might not be really necessary for LLMs. In any case, there is room for experimentation and before it is tried, it's hard to say anything with certainty.

Maykeye Jan 19, 2024

Feedback on upscaler. Sorry if rambling is too long

TLDR^2: I propose to focus on upscaling heads rather than dimension as whole, by padding zero after each head(specifically split to num_attention_heads, not smaller num_key_value_heads)

TLDR#1: By remembering that first operation in a layer is the attention we can keep skip connections intact.
TLDR#2: MLP already produces big output(gate*up) internally. No need to upcast the downcasted result

I think upscaling shall be split in two parts: residual upscaler and change ffn.

FFN already changes dimension size. It changes from dim_ffn to dim_hidden.

If you add upscaling after FFN that it'll be essentially baked-in LoRA: we approximate inner layer of MLP simply because MLP was downcasted in original architecture to that size. Might as well skip the middle man and make ffn_down to produce output which can be fed directly into next layer. It increases number of arguments, but makes FFN more useful.

instead of linear upscaler, insert zeros, in multi-head-attention-aware way.

Imagine we have n_heads=4, hiddem_dim=8, hiddem_dim_next=16.

Now upscale each token from 8 floats [a, b, c, d, e, f, g, h] to [a, b, 0, 0, c, d, 0, 0, e, f, 0, 0, g, h, 0, 0]
and add it to result of mlp.

Consider the degenerate case where Layer2, Layer3,..., LayerN on their own don't change input and return nothing but data from the skip connection. In LLama architecture Layer(N+1) says "I got you".
Its qkv for head1 are based on a,b. qkv for head2 are based on c,d and so on - exactly like it was in Layer1. We can preserve it.

Consider what happens with multi-head-attention-aware upscaler.

Qkv of head1 of Layer(N+1) now is based on (A,B,0,0).
Qkv of head2 of Layer(N+1) now is based on (C,D,0,0).

Each head sees what corresponding head of previous head did. Residual data doesn't jump from one head to another randomly.

It allows some techniques like RMT.

Continuation of 2) and intuition behind the idea. Consider RMT

Its base idea is to have memory tokens in the beginning and in the end, each sequence look like: [READ-MEMORY-TOKENS xN] [INPUT-TOKENS xN_CONTEXT_SIZE] [WRITE-MEMORY-TOKENS xN]

Once the inference is done, you copy [WRITE-MEMORY] and put it into [READ-MEMORY] for the next sequence.
It works well when dim_hidden is constant in every layer. With llama* from original it's not straightforward.
Data is too mashed at this point and there is no residual connection . Now consider multi-head zero-padding and out degenerate case once again.

Layer1 produces [A,B,0,0,C,D....] and each further layer still provides skip connections. We know where zeros were inserted and thus we can ignore them and restore [A,B,C,D] which will work for RMT.

Conversely when MLP returns something non-zero and is added over residual we go from

"I am head1 of size 4" to "I am head1 of size 8, and MLP gave me extra information on what I am, but strictly speaking this extra information is just an extra flavor, my backbone is residual I had before, so unless extra flavor overpowered it, I am what I am"

Tried it on small model (initial hidden_dim=64, layers=8, wiki103, fits on colab)

Linear upscaler for some reason worked the worst, though I didn't tweaked initialization too much(zero padding doesn't require playing with initialization). Maybe if linear was preinitalized with something like picture above, it'll work better.

Zero padding with making MLP output bigger worked the best. I also tried "linear pad", where y=fc(x) produced new values which were spread around heads(instead of 0), but it didn't work too well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLaMA* : non-uniform hidden state #4147

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

LLaMA* : non-uniform hidden state #4147

Uh oh!

ggerganov Nov 20, 2023 Maintainer

Description

upscale operator

Variations

Why would this work? Why is this better?

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

KerfuffleV2 Nov 20, 2023 Collaborator

Uh oh!

BarfingLemurs Nov 23, 2023

Uh oh!

ggerganov Nov 23, 2023 Maintainer Author

Uh oh!

Maykeye Jan 19, 2024

ggerganov
Nov 20, 2023
Maintainer

`upscale` operator

Replies: 2 comments 2 replies

KerfuffleV2
Nov 20, 2023
Collaborator

BarfingLemurs
Nov 23, 2023

ggerganov Nov 23, 2023
Maintainer Author