[HN Gopher] Efficient streaming language models with attention s...
       ___________________________________________________________________
        
       Efficient streaming language models with attention sinks
        
       Author : guywithabowtie
       Score  : 276 points
       Date   : 2023-10-02 16:56 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | heavyarms wrote:
       | Having only read the abstract, I'm probably way off the mark
       | here, but my first thought was: LLM + LSTM.
        
         | [deleted]
        
       | kridsdale3 wrote:
       | What if my "Favorite LLM" is GPT4? I don't want to use Llama or
       | anything like that. Does this GitHub code let me use the OpenAI
       | API and run the new memory technique on top of that?
        
         | doctoboggan wrote:
         | No, it does not
        
       | [deleted]
        
       | iandanforth wrote:
       | My somewhat facetious take is that LLMs are trying really hard to
       | reinvent RNNs and would do so if we just gave them the tools to
       | do so.
        
         | anon291 wrote:
         | I think many people believe you. The main advantage of
         | transformers over RNNs is training parallelization. RNNs are
         | hard because training suffers from vanishing gradients and also
         | because it's hard to get full utilization (needs large batches
         | to get good utilization).
         | 
         | The existence of models like RWKV indicates that there is
         | potentially a future in training like a transformer but
         | inferring like an RNN.
        
         | [deleted]
        
         | tkellogg wrote:
         | One such project is RWKV[1]. On the open source leaderboard it
         | lived in the middle of the board for a while, so it really is a
         | legit approach, it's just not hot.
         | 
         | [1]: https://huggingface.co/blog/rwkv
        
           | swyx wrote:
           | side note - do you think the open source leaderboard is a
           | fair representation of the diversity of OSS models?
        
         | Nevermark wrote:
         | Yes, indeedy.
         | 
         | Many things learned over the last three decades with smaller
         | (the current terminology is "extremely tiny"! :) neural
         | networks are being revisited for these large models.
        
         | obblekk wrote:
         | RNNs are the correct solution, but infeasibly expensive to run.
         | 
         | A different way to think about it is Transformer models are
         | trying to predict which part of the RNN network is "worth"
         | keeping given a resource constraint.
         | 
         | Transformers use a simple heuristic today (and this result
         | makes the heuristic better). Just like many NP complete
         | problems, there might be approximations that are not perfectly
         | correct but still useful. Transformers prove that is the case
         | for neural networks.
        
       | idiotsecant wrote:
       | This is a big claim, curious to see what the caveats are.
        
       | 13years wrote:
       | So can it now understand and write complete applications?
        
         | Jeff_Brown wrote:
         | It seems hard to imagine, if its training has been on small
         | chunks of text, that the model has a way of understanding a
         | large codebase.
         | 
         | But this stuff keeps on surprising me.
        
       | Filligree wrote:
       | Okay, what's the downside this time?
        
         | a_wild_dandan wrote:
         | Allegedly not "efficiency or performance", though I'm
         | skeptical. Will dig into this later and update my comment (if I
         | remember).
        
       | ilovefood wrote:
       | This is working relatively well, the code is really worth a read.
       | If you run it locally, consider the open PR and install
       | sentencepiece as well. It's been generating text for the past 10
       | minutes now :D
       | 
       | Some of the instructions are ignored though so I'd be careful
       | there, one instruction is to rewrite the previous response by
       | "starting every sentence with the letter A" which is a bit of a
       | hit or miss right now.
        
         | guywithabowtie wrote:
         | How is the content quality ?
        
           | ilovefood wrote:
           | It's okay I have to say. I just ran out of memory on my 4090,
           | so I had to retry on an A100. Here's an extract:
           | https://pastebin.com/pzLfCFWt
           | 
           | I think something might be off with the example. Can't wait
           | for this stuff to work on llama.cpp. Going to try it with
           | mistral & stable lm now, thankfully tomorrow is a holiday in
           | Germany :)
        
       | cs702 wrote:
       | On a first quick pass, this looks so good that I'm wondering if
       | it's _too good to be true_!
       | 
       | But the work looks to be of decent quality and the technique is
       | remarkably straightforward:
       | 
       | The idea is to apply attention over the first token and a sliding
       | context window, ignoring everything in-between, in each layer.
       | 
       | By implication, each layer must be gradually shifting relevant
       | information forward in the sequence, enabling the top layer's
       | ending sliding attention window to see it.
       | 
       | The only caveat I can think of is that the sliding windows won't
       | be able to shift all important information forward when the span
       | of all sliding windows isn't sufficient to span the entire
       | sequence -- for example, when model depth x window length <
       | sequence length, if all windows have the same length.
        
         | Nevermark wrote:
         | The end of the sequence could be padded with constant "neutral"
         | values?
        
           | cs702 wrote:
           | Wouldn't work. Imagine a sequence with 100 tokens, fed to a
           | model with 10 layers, each with a sliding attention window
           | spanning 5 tokens. The top layer's final sliding window can
           | only see 5 trailing tokens, each of which can only see 5
           | trailing tokens in the previous layer, and so on, for a total
           | of 50 trailing tokens (plus the initial token) of maximum
           | trailing context in the top layer.
           | 
           | It's an inherent limitation of this approach.
        
             | Nevermark wrote:
             | How about neutral value padding at the other end?
             | 
             | I am having trouble visualizing this.
        
         | [deleted]
        
       | Van_Chopiszt wrote:
       | The authors just uploaded a FAQ section, which may clarify some
       | of the confusions: https://github.com/mit-han-lab/streaming-
       | llm/blob/main/READM...
        
         | bluecoconut wrote:
         | Nice update. I think the key question they added that clarifies
         | a lot is #3 (quoted below)                   Can I input an
         | extensive text, like a book, into StreamingLLM for
         | summarization?              While you can input a lengthy text,
         | the model will only recognize the latest tokens. Thus, if a
         | book is an input, StreamingLLM might only summarize the
         | concluding paragraphs, which might not be very insightful. As
         | emphasized earlier, we neither expand the LLMs' context window
         | nor enhance their long-term memory. StreamingLLM's strength
         | lies in generating fluent text from recent tokens without
         | needing a cache refresh.
        
       | huevosabio wrote:
       | This seems to be largely enabled by the observation that Softmax
       | has to add up to one. From quick a glance [1], the model tends to
       | use the first token as a placeholder for cases when you don't
       | need to attend any of the prior tokens.
       | 
       | The first time I read about this issue, that Softmax is somewhat
       | flawed, was in a HN post by Evan Miller [2] where he observes
       | that forcing attention heads to allocate all attention to prior
       | tokens is wrong, and we should allow them to "not attend" by
       | adding one to the softmax denominator.
       | 
       | I love that they found a way to capitalize on this observation
       | without having to retrain models. However, I wonder how the
       | models would look like if they followed Evan's suggestion!
       | 
       | [1] Their description of attention sinks:
       | 
       | ```
       | 
       | To understand the failure of window attention, we find an
       | interesting phenomenon of autoregressive LLMs: a surprisingly
       | large amount of attention score is allocated to the initial
       | tokens, irrespective of their relevance to the language modeling
       | task, as visualized in Figure 2. We term these tokens "attention
       | sinks". Despite their lack of semantic significance, they collect
       | significant attention scores. We attribute the reason to the
       | Softmax operation, which requires attention scores to sum up to
       | one for all contextual tokens. Thus, even when the current query
       | does not have a strong match in many previous tokens, the model
       | still needs to allocate these unneeded attention values somewhere
       | so it sums up to one. The reason behind initial tokens as sink
       | tokens is intuitive: initial tokens are visible to almost all
       | subsequent tokens because of the autoregressive language modeling
       | nature, making them more readily trained to serve as attention
       | sinks.
       | 
       | ```
       | 
       | [2] https://news.ycombinator.com/item?id=36851494
        
         | fpgaminer wrote:
         | That was the first time I'd read about it on HN, but as pointed
         | out on that HN post it wasn't the first time Softmax + 1 was
         | proposed. And, AFAIK, it has never resulted in better
         | performance in practice. Maybe Softmax + 1 works better for
         | fiddling with the attention window after training, but I don't
         | know if anyone has tested that at scale.
        
         | huevosabio wrote:
         | Actually, seems like they did try the suggestion out, basically
         | by training a model with a dedicated sink token with all zeros.
         | 
         | The verdict seems to be that you still end up with other
         | initial tokens being used as sinks, so it is better to have a
         | dedicated sink token.
        
       | guywithabowtie wrote:
       | We introduce StreamingLLM, an efficient framework that enables
       | LLMs trained with a finite length attention window to generalize
       | to infinite sequence length without any fine-tuning. We show that
       | StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to
       | perform stable and efficient language modeling with up to 4
       | million tokens and more.
        
         | stavros wrote:
         | Sorry, what does "up to 4 million tokens and more" mean? It
         | seems like a contradiction.
        
           | jamesblonde wrote:
           | Here's a reference describing what a context window for LLMs
           | is:
           | 
           | https://www.hopsworks.ai/dictionary/context-window-for-llms
        
           | catskul2 wrote:
           | Not really a contradiction so much as redundant/poorly
           | worded. Should have said, "at least 4 million tokens".
        
         | [deleted]
        
       | bluecoconut wrote:
       | I think people are misreading this work, and assuming this is
       | equivalent to full dense-attention. This is just saying its an
       | efficiency gain over sliding window re-computation, where instead
       | of computing the L^2 cost over and over (T times), you can re-use
       | a cache and maintain perplexity. I don't think they are claiming
       | that this allows for attending to content that was far away.
       | 
       | They tested by running concatenating and measuring -> `Q A Q A Q
       | A Q A...` not by doing `Q Q Q Q A A A A...`
       | 
       | They also measure perplexity, showing that it produces "readable
       | text" (coherent, locally viable); not that it is "extracting
       | anything" from the big-triangle-gap of no-attention.
       | 
       | I think this would fail to be given a book, then write the first
       | word of every paragraph. Or, given a book, write a 1 sentence
       | summary of each chapter. I might be wrong, because they didn't
       | test tasks like this, but I'd be very very surprised.
        
         | fpgaminer wrote:
         | Correct, but to be fair to readers (like me) the use of the
         | term "infinite-length inputs" is misleading.
         | 
         | Still, really interesting work. The most salient bit is the
         | discovery shown in Figure 2, summarized as:
         | 
         | > (1) The attention maps in the first two layers (layers 0 and
         | 1) exhibit the "local" pattern, with recent tokens receiving
         | more attention. (2) Beyond the bottom two layers, the model
         | heavily attends to the initial token across all layers and
         | heads.
         | 
         | > surprisingly large amount of attention score is allocated to
         | the initial tokens, irrespective of their relevance to the
         | language modeling task, as visualized in Figure 2. We term
         | these tokens "attention sinks". Despite their lack of semantic
         | significance, they collect significant attention scores. We
         | attribute the reason to the Softmax operation, which requires
         | attention scores to sum up to one for all contextual tokens.
         | Thus, even when the current query does not have a strong match
         | in many previous tokens, the model still needs to allocate
         | these unneeded attention values somewhere so it sums up to one.
         | The reason behind initial tokens as sink tokens is intuitive:
         | initial tokens are visible to almost all subsequent tokens
         | because of the autoregressive language modeling nature, making
         | them more readily trained to serve as attention sinks.
         | 
         | StreamingLLM is basically a "hack" that fixes this odd behavior
         | when we go around butchering the LLM's attention window.
         | 
         | This actually isn't the first time cracks have been shown in
         | the usage of softmax and it makes me wonder if a different
         | function might be better if we want context-length flexible
         | LLMs.
        
         | bluecoconut wrote:
         | EDIT: the authors have updated the readme to add a clarified
         | FAQ section that directly addresses this:
         | https://github.com/mit-han-lab/streaming-llm#faq
         | 
         | Just tested it - this definitely doesn't seem to be giving
         | enhanced context length. It does run quickly though, can
         | confirm it was using about 35 GB of an A100 RAM and pinned the
         | usage for the entire duration.
         | 
         | I ran through by getting a book from project gutenberg,
         | splitting it into paragraphs, and feeding them in paragraph by
         | paragraph (asking it to say "okay" each paragraph), then at the
         | end, asked some questions. It entirely hallucinated its
         | answers. (also note: in the ~10 min of playing with this, i
         | couldn't get the base model (lmsys/vicuna-13b-v1.3) to respond
         | in english...)
         | 
         | https://gist.github.com/bluecoconut/9cae9e91fe3b1616ed650a96...
        
       | dheera wrote:
       | I feel like information theory prevents full information
       | retention for unlimited context lengths and finite compute, but I
       | don't know if we are at information theory limits to invoke this
       | argument. Or rather, I don't know how to make a good analysis of
       | (bits of context information) per (bits of model parameters).
        
       | foota wrote:
       | I could be wrong, but I'm not sure this is about what people seem
       | to think it is, e.g., letting LLMs reference content past the
       | trained length
       | 
       | I think it may just be about the performance of the model with
       | longer texts (on the things still within the context window?). It
       | sounds like they're arguing that the model is essentially
       | learning to stick some baggage in the attention to the initial
       | tokens of the text, and break when that isn't within the window
       | anymore for reasons I'm not sure I understand (after all, isn't
       | text in the middle just as good as text at the start for non
       | instruction inputs?)
        
       | arxiv_papers wrote:
       | https://youtu.be/hfJIOd2WCQ0
        
       | WhatsName wrote:
       | So I can let llama2 summarize books now or are there any non-
       | obvious caveats to this approach?
        
         | Sharlin wrote:
         | No. This does nothing to the context length itself which is
         | still a sliding window.
        
       | doctoboggan wrote:
       | How do any of these sliding window techniques handle instructions
       | that are non expected and only show up at the end? For example
       | imagine feeding a book to the model and the last sentence being
       | the instruction "return the count of the letter m in the previous
       | input". A human would handle this by first letting out an
       | exasperated sigh but then restarting the reading while counting.
       | An LLM has no ability to loop back and re-read the input. (Ignore
       | LLM issues with character counting for this example). It seems
       | like to solve this problem for real the LLM needs to be able to
       | loop and jump arbitrarily, but I'm sure that would introduce a
       | whole new host of issues and possibly require a new architecture
       | all together.
        
         | tornato7 wrote:
         | Is it so hard to ask the user to put instructions at the
         | beginning? Claude 100K asks users to put instructions at the
         | end.
         | 
         | Or you just use a quick model to check if there area
         | instructions at the end and bring it to the beginning.
        
         | IanCal wrote:
         | One option would be similar to function calling, give the llm
         | an output it can make that changes how the context is parsed.
         | That's a layer on top rather than changing how the llm itself
         | works.
        
         | omneity wrote:
         | Does an LLM need to loop back to re-read its input, even in a
         | regular (read non-sliding) context window?
         | 
         | Maybe I'm misunderstanding, but doesn't the hidden state solve
         | the "lookup" problem in this case? In the sense that the LLM
         | needs to ingest your entire input anyway before answering, then
         | whether your instruction is at the front or at the end carries
         | little impact besides on attention.
        
           | doctoboggan wrote:
           | It's my understanding that in regular non-sliding window
           | context models the llm is able to pay attention to any part
           | of the input when generating the output. The attention head
           | is essentially able to jump back and forward to any point in
           | its context window. This is what differentiates the attention
           | mechanism from other models that use token proximity as a
           | proxy for relevance.
        
         | alex_duf wrote:
         | The example seems like a weird edge case. I don't even know if
         | current models are capable of this in a short input.
        
           | refulgentis wrote:
           | I agree, even just tokenization screws you here, I'm 95%
           | sure. I.e. the raw input isn't letters but one of 100K
           | integers that represent some set of letters.
           | 
           | That being said, probably a naive take, since we're seeing
           | them do so much. & I bet we could get it to count correctly
           | with at least some short input, and given infinite runs,
           | probably trivial. (I.e. for N characters, split into N
           | inputs, for each one "say true if it is an M, false
           | otherwise,)
        
             | doctoboggan wrote:
             | I understand that, which is why I said "Ignore LLM issues
             | with character counting for this example". It was a quick
             | example, please see my other comment with a better example.
        
               | refulgentis wrote:
               | I see, active listening + relating it to my knowledge on
               | my end, lmk if I compressed too much:
               | 
               | you're curious if there's noticably worse performance if
               | the Q is at the end of content rather than before
               | 
               | No, there's a good paper on this somewhere with the
               | Claude 100K, tldr it's sort of bow-shaped, beginning and
               | end had equally high rates but middle would suffer
        
               | doctoboggan wrote:
               | No, what I am specifically asking about is these sliding
               | window attention techniques. As far as I understand it
               | Claude 100K actually uses a 100k context window, and not
               | a sliding window.
        
           | doctoboggan wrote:
           | Ignore the specific example of counting characters, I was
           | just quickly coming up with a situation where the instruction
           | is at the end of the input. Here is a better example:
           | 
           | Input the full text of a novel, then ask for a minor detail
           | (eg color of a car that is briefly mentioned in the middle of
           | the book). Again a human can do this by flipping back to the
           | relevant section but LLMs have no mechanism for this when
           | using a sliding window attention scheme.
           | 
           | If the full input can fit in the context window then any LLM
           | today would be able to extract the color of the car.
        
         | [deleted]
        
       | refulgentis wrote:
       | This looks fantastic. Also answers the relevancy of the "off-by-
       | one" softmax*
       | 
       | My naive question is...does it work? But that sounds dismissive.
       | At length:
       | 
       | It shows that the model can't respond after a certain length
       | versus a proposed model that does continue to respond.
       | 
       | But can a model that continues to respond retrieve information
       | far "in the past"?
       | 
       | The demo video is too low-level, at least to my brain. It shows
       | one model stops responding but the proposed one continues.
       | 
       | I spent about 5 minutes going frame by frame to see if the
       | proposed model attempts to have to "recall" information from
       | further back, but it looks like no.
       | 
       | Perfection here isn't necessary or even possible AFAIK, i.e. I
       | don't expect it to recall page 1 100% accurately at page 1000.
       | But can it recall _anything_ from it, even if it ignores it?
       | 
       | The great thing about this era and work is we can check. But I
       | hope someone has it up in a HuggingFace space before I figure out
       | how to run it myself. :P
       | 
       | I'm leaning no, based on the sliding window thing. It sounds like
       | there's 4 fixed tokens, then the last context size - 4 tokens,
       | that's it
       | 
       | * at the time, two camps: one, it's some random person saying it
       | and there's prior art on implementations that do the off-by-one.
       | Two, you'd be surprised how much little things go unnoticed by
       | large groups, and do matter.
        
       | orangecoconut1 wrote:
       | [flagged]
        
       | smeeth wrote:
       | Adding attention cache memory is an extremely interesting
       | solution to this problem.
       | 
       | If anyone is curious, there was another paper [0] that came out a
       | few days ago that made a related observation in Vision
       | Transformers. Transformer models appear to pick tokens to store
       | global information in - they need tokens to "think". You can eek
       | some performance improvements (and cool explanation images) by
       | providing the model with specific tokens for this purpose.
       | 
       | [0] https://arxiv.org/pdf/2309.16588.pdf
        
         | Nevermark wrote:
         | It would be an interesting place to add additional units to an
         | already trained model, to continue training and get better
         | performance, or to fine tuning.
         | 
         | For tuning, keep the original model parameters fixed, and only
         | let the model adjust parameters to and from new "tuning" cache
         | units.
         | 
         | This would allow different tuning unit sets to be swapped in,
         | or even used together. Foul language avoidance units + specific
         | terminology units + be concise units, etc.
         | 
         | Mix and match tuned unit sets, like super prompts.
         | 
         | --
         | 
         | If the number of new parameters is low enough, higher order
         | optimization (requiring higher memory) might be a possibility
         | for very fast and effective tuning.
         | 
         | --
         | 
         | And maybe grow the sequence length, and number of units, during
         | training. A few units for short sequences. Then increase
         | training sequence length, add more units, continue training,
         | and so on.
         | 
         | Perhaps some kind of performance or gradient analysis could
         | govern cache expansion, so an arbitrary schedule is not
         | required.
        
       ___________________________________________________________________
       (page generated 2023-10-02 23:00 UTC)