[HN Gopher] Multi-Token Attention
       ___________________________________________________________________
        
       Multi-Token Attention
        
       Author : fzliu
       Score  : 12 points
       Date   : 2025-04-02 22:20 UTC (39 minutes ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | bigdict wrote:
       | Sure, you can get better model performance by throwing more
       | compute at the problem in different places. Does is it improve
       | perf on an isoflop basis?
        
       | jwilber wrote:
       | Achieved by "applying convolution operations over queries, keys
       | and heads, allowing nearby queries and keys to affect each
       | other's attention weights for more precise attention"
       | 
       | Cool to see convolutions making such a comeback lately in the llm
       | world. See also the recent striped hyena2 architecture, which
       | uses the conv-based hyena operator to great success:
       | 
       | https://arxiv.org/abs/2503.01868
        
       ___________________________________________________________________
       (page generated 2025-04-02 23:00 UTC)