[HN Gopher] StreamingLLM: tiny tweak to KV LRU improves long con...
___________________________________________________________________
StreamingLLM: tiny tweak to KV LRU improves long conversations
Author : lucasluitjes
Score : 84 points
Date : 2024-02-13 08:47 UTC (14 hours ago)
(HTM) web link (news.mit.edu)
(TXT) w3m dump (news.mit.edu)
| popinman322 wrote:
| Previous discussion, on a link to the implementation:
| https://news.ycombinator.com/item?id=37740932
| Translationaut wrote:
| This seems only to work cause large GPTs have redundant,
| undercomplex attentions. See this issue in BertViz about
| attention in Llama:
| https://github.com/jessevig/bertviz/issues/128
| gremlinsinc wrote:
| I wonder if it could make sense to maybe have break away bots,
| where at 10k tokens a new one launches with the first 2k, and the
| last 1k and a table of contents such that when you go back to
| something you're handed off to a model where that data is
| stronger reinforced or something like that. Sort of like mixture
| of experts but they're only an expert about individual snippets
| of a long conversational thread.
| joshspankit wrote:
| You're right: A lot of the conversation can be condensed,
| especially if there are enough cues for the AI to arrive in the
| same "neuronal neighborhood" as the previous conversation.
| kgeist wrote:
| Here they simply used different models for different turns and
| apparently it gave more "engaging" results:
|
| https://arxiv.org/abs/2401.02994
| TrueDuality wrote:
| There was a really interesting post a while ago about adjusting
| the softmax function to allow attention heads to not make a
| choice (https://www.evanmiller.org/attention-is-off-by-one.html).
| It seems like that might remove the need for these attention
| sinks entirely. I keep meaning to go in and perform tests on this
| but boy time gets away from you...
| magicalhippo wrote:
| Interesting! HN discussion of it here:
| https://news.ycombinator.com/item?id=36851494
| zorgmonkey wrote:
| Feel free to mess with it, his tweak to softmax was actually
| supported by pytorch before the article was written, but off by
| default. Maybe it needs to be more widely used though, after
| all good ideas are often independently discovered multiple
| times. Details are in this tweet
| https://twitter.com/SamuelMullr/status/1683582347793530884 or
| if you don't like twitter the option is add_zero_attn for
| pytorch MultiheadAttention.
___________________________________________________________________
(page generated 2024-02-13 23:01 UTC)