[HN Gopher] Multi-Token Attention
___________________________________________________________________
Multi-Token Attention
Author : fzliu
Score : 12 points
Date : 2025-04-02 22:20 UTC (39 minutes ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| bigdict wrote:
| Sure, you can get better model performance by throwing more
| compute at the problem in different places. Does is it improve
| perf on an isoflop basis?
| jwilber wrote:
| Achieved by "applying convolution operations over queries, keys
| and heads, allowing nearby queries and keys to affect each
| other's attention weights for more precise attention"
|
| Cool to see convolutions making such a comeback lately in the llm
| world. See also the recent striped hyena2 architecture, which
| uses the conv-based hyena operator to great success:
|
| https://arxiv.org/abs/2503.01868
___________________________________________________________________
(page generated 2025-04-02 23:00 UTC)