[HN Gopher] StreamingLLM: tiny tweak to KV LRU improves long con...
       ___________________________________________________________________
        
       StreamingLLM: tiny tweak to KV LRU improves long conversations
        
       Author : lucasluitjes
       Score  : 84 points
       Date   : 2024-02-13 08:47 UTC (14 hours ago)
        
 (HTM) web link (news.mit.edu)
 (TXT) w3m dump (news.mit.edu)
        
       | popinman322 wrote:
       | Previous discussion, on a link to the implementation:
       | https://news.ycombinator.com/item?id=37740932
        
       | Translationaut wrote:
       | This seems only to work cause large GPTs have redundant,
       | undercomplex attentions. See this issue in BertViz about
       | attention in Llama:
       | https://github.com/jessevig/bertviz/issues/128
        
       | gremlinsinc wrote:
       | I wonder if it could make sense to maybe have break away bots,
       | where at 10k tokens a new one launches with the first 2k, and the
       | last 1k and a table of contents such that when you go back to
       | something you're handed off to a model where that data is
       | stronger reinforced or something like that. Sort of like mixture
       | of experts but they're only an expert about individual snippets
       | of a long conversational thread.
        
         | joshspankit wrote:
         | You're right: A lot of the conversation can be condensed,
         | especially if there are enough cues for the AI to arrive in the
         | same "neuronal neighborhood" as the previous conversation.
        
         | kgeist wrote:
         | Here they simply used different models for different turns and
         | apparently it gave more "engaging" results:
         | 
         | https://arxiv.org/abs/2401.02994
        
       | TrueDuality wrote:
       | There was a really interesting post a while ago about adjusting
       | the softmax function to allow attention heads to not make a
       | choice (https://www.evanmiller.org/attention-is-off-by-one.html).
       | It seems like that might remove the need for these attention
       | sinks entirely. I keep meaning to go in and perform tests on this
       | but boy time gets away from you...
        
         | magicalhippo wrote:
         | Interesting! HN discussion of it here:
         | https://news.ycombinator.com/item?id=36851494
        
         | zorgmonkey wrote:
         | Feel free to mess with it, his tweak to softmax was actually
         | supported by pytorch before the article was written, but off by
         | default. Maybe it needs to be more widely used though, after
         | all good ideas are often independently discovered multiple
         | times. Details are in this tweet
         | https://twitter.com/SamuelMullr/status/1683582347793530884 or
         | if you don't like twitter the option is add_zero_attn for
         | pytorch MultiheadAttention.
        
       ___________________________________________________________________
       (page generated 2024-02-13 23:01 UTC)