[HN Gopher] Efficient streaming language models with attention s...
___________________________________________________________________
Efficient streaming language models with attention sinks
Author : guywithabowtie
Score : 276 points
Date : 2023-10-02 16:56 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| heavyarms wrote:
| Having only read the abstract, I'm probably way off the mark
| here, but my first thought was: LLM + LSTM.
| [deleted]
| kridsdale3 wrote:
| What if my "Favorite LLM" is GPT4? I don't want to use Llama or
| anything like that. Does this GitHub code let me use the OpenAI
| API and run the new memory technique on top of that?
| doctoboggan wrote:
| No, it does not
| [deleted]
| iandanforth wrote:
| My somewhat facetious take is that LLMs are trying really hard to
| reinvent RNNs and would do so if we just gave them the tools to
| do so.
| anon291 wrote:
| I think many people believe you. The main advantage of
| transformers over RNNs is training parallelization. RNNs are
| hard because training suffers from vanishing gradients and also
| because it's hard to get full utilization (needs large batches
| to get good utilization).
|
| The existence of models like RWKV indicates that there is
| potentially a future in training like a transformer but
| inferring like an RNN.
| [deleted]
| tkellogg wrote:
| One such project is RWKV[1]. On the open source leaderboard it
| lived in the middle of the board for a while, so it really is a
| legit approach, it's just not hot.
|
| [1]: https://huggingface.co/blog/rwkv
| swyx wrote:
| side note - do you think the open source leaderboard is a
| fair representation of the diversity of OSS models?
| Nevermark wrote:
| Yes, indeedy.
|
| Many things learned over the last three decades with smaller
| (the current terminology is "extremely tiny"! :) neural
| networks are being revisited for these large models.
| obblekk wrote:
| RNNs are the correct solution, but infeasibly expensive to run.
|
| A different way to think about it is Transformer models are
| trying to predict which part of the RNN network is "worth"
| keeping given a resource constraint.
|
| Transformers use a simple heuristic today (and this result
| makes the heuristic better). Just like many NP complete
| problems, there might be approximations that are not perfectly
| correct but still useful. Transformers prove that is the case
| for neural networks.
| idiotsecant wrote:
| This is a big claim, curious to see what the caveats are.
| 13years wrote:
| So can it now understand and write complete applications?
| Jeff_Brown wrote:
| It seems hard to imagine, if its training has been on small
| chunks of text, that the model has a way of understanding a
| large codebase.
|
| But this stuff keeps on surprising me.
| Filligree wrote:
| Okay, what's the downside this time?
| a_wild_dandan wrote:
| Allegedly not "efficiency or performance", though I'm
| skeptical. Will dig into this later and update my comment (if I
| remember).
| ilovefood wrote:
| This is working relatively well, the code is really worth a read.
| If you run it locally, consider the open PR and install
| sentencepiece as well. It's been generating text for the past 10
| minutes now :D
|
| Some of the instructions are ignored though so I'd be careful
| there, one instruction is to rewrite the previous response by
| "starting every sentence with the letter A" which is a bit of a
| hit or miss right now.
| guywithabowtie wrote:
| How is the content quality ?
| ilovefood wrote:
| It's okay I have to say. I just ran out of memory on my 4090,
| so I had to retry on an A100. Here's an extract:
| https://pastebin.com/pzLfCFWt
|
| I think something might be off with the example. Can't wait
| for this stuff to work on llama.cpp. Going to try it with
| mistral & stable lm now, thankfully tomorrow is a holiday in
| Germany :)
| cs702 wrote:
| On a first quick pass, this looks so good that I'm wondering if
| it's _too good to be true_!
|
| But the work looks to be of decent quality and the technique is
| remarkably straightforward:
|
| The idea is to apply attention over the first token and a sliding
| context window, ignoring everything in-between, in each layer.
|
| By implication, each layer must be gradually shifting relevant
| information forward in the sequence, enabling the top layer's
| ending sliding attention window to see it.
|
| The only caveat I can think of is that the sliding windows won't
| be able to shift all important information forward when the span
| of all sliding windows isn't sufficient to span the entire
| sequence -- for example, when model depth x window length <
| sequence length, if all windows have the same length.
| Nevermark wrote:
| The end of the sequence could be padded with constant "neutral"
| values?
| cs702 wrote:
| Wouldn't work. Imagine a sequence with 100 tokens, fed to a
| model with 10 layers, each with a sliding attention window
| spanning 5 tokens. The top layer's final sliding window can
| only see 5 trailing tokens, each of which can only see 5
| trailing tokens in the previous layer, and so on, for a total
| of 50 trailing tokens (plus the initial token) of maximum
| trailing context in the top layer.
|
| It's an inherent limitation of this approach.
| Nevermark wrote:
| How about neutral value padding at the other end?
|
| I am having trouble visualizing this.
| [deleted]
| Van_Chopiszt wrote:
| The authors just uploaded a FAQ section, which may clarify some
| of the confusions: https://github.com/mit-han-lab/streaming-
| llm/blob/main/READM...
| bluecoconut wrote:
| Nice update. I think the key question they added that clarifies
| a lot is #3 (quoted below) Can I input an
| extensive text, like a book, into StreamingLLM for
| summarization? While you can input a lengthy text,
| the model will only recognize the latest tokens. Thus, if a
| book is an input, StreamingLLM might only summarize the
| concluding paragraphs, which might not be very insightful. As
| emphasized earlier, we neither expand the LLMs' context window
| nor enhance their long-term memory. StreamingLLM's strength
| lies in generating fluent text from recent tokens without
| needing a cache refresh.
| huevosabio wrote:
| This seems to be largely enabled by the observation that Softmax
| has to add up to one. From quick a glance [1], the model tends to
| use the first token as a placeholder for cases when you don't
| need to attend any of the prior tokens.
|
| The first time I read about this issue, that Softmax is somewhat
| flawed, was in a HN post by Evan Miller [2] where he observes
| that forcing attention heads to allocate all attention to prior
| tokens is wrong, and we should allow them to "not attend" by
| adding one to the softmax denominator.
|
| I love that they found a way to capitalize on this observation
| without having to retrain models. However, I wonder how the
| models would look like if they followed Evan's suggestion!
|
| [1] Their description of attention sinks:
|
| ```
|
| To understand the failure of window attention, we find an
| interesting phenomenon of autoregressive LLMs: a surprisingly
| large amount of attention score is allocated to the initial
| tokens, irrespective of their relevance to the language modeling
| task, as visualized in Figure 2. We term these tokens "attention
| sinks". Despite their lack of semantic significance, they collect
| significant attention scores. We attribute the reason to the
| Softmax operation, which requires attention scores to sum up to
| one for all contextual tokens. Thus, even when the current query
| does not have a strong match in many previous tokens, the model
| still needs to allocate these unneeded attention values somewhere
| so it sums up to one. The reason behind initial tokens as sink
| tokens is intuitive: initial tokens are visible to almost all
| subsequent tokens because of the autoregressive language modeling
| nature, making them more readily trained to serve as attention
| sinks.
|
| ```
|
| [2] https://news.ycombinator.com/item?id=36851494
| fpgaminer wrote:
| That was the first time I'd read about it on HN, but as pointed
| out on that HN post it wasn't the first time Softmax + 1 was
| proposed. And, AFAIK, it has never resulted in better
| performance in practice. Maybe Softmax + 1 works better for
| fiddling with the attention window after training, but I don't
| know if anyone has tested that at scale.
| huevosabio wrote:
| Actually, seems like they did try the suggestion out, basically
| by training a model with a dedicated sink token with all zeros.
|
| The verdict seems to be that you still end up with other
| initial tokens being used as sinks, so it is better to have a
| dedicated sink token.
| guywithabowtie wrote:
| We introduce StreamingLLM, an efficient framework that enables
| LLMs trained with a finite length attention window to generalize
| to infinite sequence length without any fine-tuning. We show that
| StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to
| perform stable and efficient language modeling with up to 4
| million tokens and more.
| stavros wrote:
| Sorry, what does "up to 4 million tokens and more" mean? It
| seems like a contradiction.
| jamesblonde wrote:
| Here's a reference describing what a context window for LLMs
| is:
|
| https://www.hopsworks.ai/dictionary/context-window-for-llms
| catskul2 wrote:
| Not really a contradiction so much as redundant/poorly
| worded. Should have said, "at least 4 million tokens".
| [deleted]
| bluecoconut wrote:
| I think people are misreading this work, and assuming this is
| equivalent to full dense-attention. This is just saying its an
| efficiency gain over sliding window re-computation, where instead
| of computing the L^2 cost over and over (T times), you can re-use
| a cache and maintain perplexity. I don't think they are claiming
| that this allows for attending to content that was far away.
|
| They tested by running concatenating and measuring -> `Q A Q A Q
| A Q A...` not by doing `Q Q Q Q A A A A...`
|
| They also measure perplexity, showing that it produces "readable
| text" (coherent, locally viable); not that it is "extracting
| anything" from the big-triangle-gap of no-attention.
|
| I think this would fail to be given a book, then write the first
| word of every paragraph. Or, given a book, write a 1 sentence
| summary of each chapter. I might be wrong, because they didn't
| test tasks like this, but I'd be very very surprised.
| fpgaminer wrote:
| Correct, but to be fair to readers (like me) the use of the
| term "infinite-length inputs" is misleading.
|
| Still, really interesting work. The most salient bit is the
| discovery shown in Figure 2, summarized as:
|
| > (1) The attention maps in the first two layers (layers 0 and
| 1) exhibit the "local" pattern, with recent tokens receiving
| more attention. (2) Beyond the bottom two layers, the model
| heavily attends to the initial token across all layers and
| heads.
|
| > surprisingly large amount of attention score is allocated to
| the initial tokens, irrespective of their relevance to the
| language modeling task, as visualized in Figure 2. We term
| these tokens "attention sinks". Despite their lack of semantic
| significance, they collect significant attention scores. We
| attribute the reason to the Softmax operation, which requires
| attention scores to sum up to one for all contextual tokens.
| Thus, even when the current query does not have a strong match
| in many previous tokens, the model still needs to allocate
| these unneeded attention values somewhere so it sums up to one.
| The reason behind initial tokens as sink tokens is intuitive:
| initial tokens are visible to almost all subsequent tokens
| because of the autoregressive language modeling nature, making
| them more readily trained to serve as attention sinks.
|
| StreamingLLM is basically a "hack" that fixes this odd behavior
| when we go around butchering the LLM's attention window.
|
| This actually isn't the first time cracks have been shown in
| the usage of softmax and it makes me wonder if a different
| function might be better if we want context-length flexible
| LLMs.
| bluecoconut wrote:
| EDIT: the authors have updated the readme to add a clarified
| FAQ section that directly addresses this:
| https://github.com/mit-han-lab/streaming-llm#faq
|
| Just tested it - this definitely doesn't seem to be giving
| enhanced context length. It does run quickly though, can
| confirm it was using about 35 GB of an A100 RAM and pinned the
| usage for the entire duration.
|
| I ran through by getting a book from project gutenberg,
| splitting it into paragraphs, and feeding them in paragraph by
| paragraph (asking it to say "okay" each paragraph), then at the
| end, asked some questions. It entirely hallucinated its
| answers. (also note: in the ~10 min of playing with this, i
| couldn't get the base model (lmsys/vicuna-13b-v1.3) to respond
| in english...)
|
| https://gist.github.com/bluecoconut/9cae9e91fe3b1616ed650a96...
| dheera wrote:
| I feel like information theory prevents full information
| retention for unlimited context lengths and finite compute, but I
| don't know if we are at information theory limits to invoke this
| argument. Or rather, I don't know how to make a good analysis of
| (bits of context information) per (bits of model parameters).
| foota wrote:
| I could be wrong, but I'm not sure this is about what people seem
| to think it is, e.g., letting LLMs reference content past the
| trained length
|
| I think it may just be about the performance of the model with
| longer texts (on the things still within the context window?). It
| sounds like they're arguing that the model is essentially
| learning to stick some baggage in the attention to the initial
| tokens of the text, and break when that isn't within the window
| anymore for reasons I'm not sure I understand (after all, isn't
| text in the middle just as good as text at the start for non
| instruction inputs?)
| arxiv_papers wrote:
| https://youtu.be/hfJIOd2WCQ0
| WhatsName wrote:
| So I can let llama2 summarize books now or are there any non-
| obvious caveats to this approach?
| Sharlin wrote:
| No. This does nothing to the context length itself which is
| still a sliding window.
| doctoboggan wrote:
| How do any of these sliding window techniques handle instructions
| that are non expected and only show up at the end? For example
| imagine feeding a book to the model and the last sentence being
| the instruction "return the count of the letter m in the previous
| input". A human would handle this by first letting out an
| exasperated sigh but then restarting the reading while counting.
| An LLM has no ability to loop back and re-read the input. (Ignore
| LLM issues with character counting for this example). It seems
| like to solve this problem for real the LLM needs to be able to
| loop and jump arbitrarily, but I'm sure that would introduce a
| whole new host of issues and possibly require a new architecture
| all together.
| tornato7 wrote:
| Is it so hard to ask the user to put instructions at the
| beginning? Claude 100K asks users to put instructions at the
| end.
|
| Or you just use a quick model to check if there area
| instructions at the end and bring it to the beginning.
| IanCal wrote:
| One option would be similar to function calling, give the llm
| an output it can make that changes how the context is parsed.
| That's a layer on top rather than changing how the llm itself
| works.
| omneity wrote:
| Does an LLM need to loop back to re-read its input, even in a
| regular (read non-sliding) context window?
|
| Maybe I'm misunderstanding, but doesn't the hidden state solve
| the "lookup" problem in this case? In the sense that the LLM
| needs to ingest your entire input anyway before answering, then
| whether your instruction is at the front or at the end carries
| little impact besides on attention.
| doctoboggan wrote:
| It's my understanding that in regular non-sliding window
| context models the llm is able to pay attention to any part
| of the input when generating the output. The attention head
| is essentially able to jump back and forward to any point in
| its context window. This is what differentiates the attention
| mechanism from other models that use token proximity as a
| proxy for relevance.
| alex_duf wrote:
| The example seems like a weird edge case. I don't even know if
| current models are capable of this in a short input.
| refulgentis wrote:
| I agree, even just tokenization screws you here, I'm 95%
| sure. I.e. the raw input isn't letters but one of 100K
| integers that represent some set of letters.
|
| That being said, probably a naive take, since we're seeing
| them do so much. & I bet we could get it to count correctly
| with at least some short input, and given infinite runs,
| probably trivial. (I.e. for N characters, split into N
| inputs, for each one "say true if it is an M, false
| otherwise,)
| doctoboggan wrote:
| I understand that, which is why I said "Ignore LLM issues
| with character counting for this example". It was a quick
| example, please see my other comment with a better example.
| refulgentis wrote:
| I see, active listening + relating it to my knowledge on
| my end, lmk if I compressed too much:
|
| you're curious if there's noticably worse performance if
| the Q is at the end of content rather than before
|
| No, there's a good paper on this somewhere with the
| Claude 100K, tldr it's sort of bow-shaped, beginning and
| end had equally high rates but middle would suffer
| doctoboggan wrote:
| No, what I am specifically asking about is these sliding
| window attention techniques. As far as I understand it
| Claude 100K actually uses a 100k context window, and not
| a sliding window.
| doctoboggan wrote:
| Ignore the specific example of counting characters, I was
| just quickly coming up with a situation where the instruction
| is at the end of the input. Here is a better example:
|
| Input the full text of a novel, then ask for a minor detail
| (eg color of a car that is briefly mentioned in the middle of
| the book). Again a human can do this by flipping back to the
| relevant section but LLMs have no mechanism for this when
| using a sliding window attention scheme.
|
| If the full input can fit in the context window then any LLM
| today would be able to extract the color of the car.
| [deleted]
| refulgentis wrote:
| This looks fantastic. Also answers the relevancy of the "off-by-
| one" softmax*
|
| My naive question is...does it work? But that sounds dismissive.
| At length:
|
| It shows that the model can't respond after a certain length
| versus a proposed model that does continue to respond.
|
| But can a model that continues to respond retrieve information
| far "in the past"?
|
| The demo video is too low-level, at least to my brain. It shows
| one model stops responding but the proposed one continues.
|
| I spent about 5 minutes going frame by frame to see if the
| proposed model attempts to have to "recall" information from
| further back, but it looks like no.
|
| Perfection here isn't necessary or even possible AFAIK, i.e. I
| don't expect it to recall page 1 100% accurately at page 1000.
| But can it recall _anything_ from it, even if it ignores it?
|
| The great thing about this era and work is we can check. But I
| hope someone has it up in a HuggingFace space before I figure out
| how to run it myself. :P
|
| I'm leaning no, based on the sliding window thing. It sounds like
| there's 4 fixed tokens, then the last context size - 4 tokens,
| that's it
|
| * at the time, two camps: one, it's some random person saying it
| and there's prior art on implementations that do the off-by-one.
| Two, you'd be surprised how much little things go unnoticed by
| large groups, and do matter.
| orangecoconut1 wrote:
| [flagged]
| smeeth wrote:
| Adding attention cache memory is an extremely interesting
| solution to this problem.
|
| If anyone is curious, there was another paper [0] that came out a
| few days ago that made a related observation in Vision
| Transformers. Transformer models appear to pick tokens to store
| global information in - they need tokens to "think". You can eek
| some performance improvements (and cool explanation images) by
| providing the model with specific tokens for this purpose.
|
| [0] https://arxiv.org/pdf/2309.16588.pdf
| Nevermark wrote:
| It would be an interesting place to add additional units to an
| already trained model, to continue training and get better
| performance, or to fine tuning.
|
| For tuning, keep the original model parameters fixed, and only
| let the model adjust parameters to and from new "tuning" cache
| units.
|
| This would allow different tuning unit sets to be swapped in,
| or even used together. Foul language avoidance units + specific
| terminology units + be concise units, etc.
|
| Mix and match tuned unit sets, like super prompts.
|
| --
|
| If the number of new parameters is low enough, higher order
| optimization (requiring higher memory) might be a possibility
| for very fast and effective tuning.
|
| --
|
| And maybe grow the sequence length, and number of units, during
| training. A few units for short sequences. Then increase
| training sequence length, add more units, continue training,
| and so on.
|
| Perhaps some kind of performance or gradient analysis could
| govern cache expansion, so an arbitrary schedule is not
| required.
___________________________________________________________________
(page generated 2023-10-02 23:00 UTC)