[HN Gopher] New technology could blow away GPT-4 and everything ...
       ___________________________________________________________________
        
       New technology could blow away GPT-4 and everything like it
        
       Author : andy_threos_io
       Score  : 60 points
       Date   : 2023-04-21 16:54 UTC (6 hours ago)
        
 (HTM) web link (www.zdnet.com)
 (TXT) w3m dump (www.zdnet.com)
        
       | barbariangrunge wrote:
       | > At 64,000 tokens, the authors relate, "Hyena speed-ups reach
       | 100x" -- a one-hundred-fold performance improvement.
       | 
       | That's quite the difference
        
         | sottol wrote:
         | Classic attention is quadratic in context length and faster
         | alternatives seem to not perform as well, wonder how Hyena is
         | in comparison to linear attention algorithms.
        
       | PaulHoule wrote:
       | I had so much fun with CNN models just before BERT hit it big. It
       | would be nice to see them make a comeback.
        
       | saurabh20n wrote:
       | Notes from quick read of paper at
       | https://arxiv.org/abs/2302.10866. Title of popsci is
       | overreaching, this is a drop-in subquadratic replacement for
       | attention. Could be promising, but to be seen if it is adopted in
       | practice. skybrian
       | (https://news.ycombinator.com/item?id=35657983) points out new
       | blog post by authors, and prev discussion of older (march 28th)
       | blog post. Takeaways:
       | 
       | * In standard attention in transformers, cost scales
       | quadratically with length of sequence, which restricts model
       | context. This work presents subquadratic exact operator allowing
       | it to scale to larger contexts (100k+).
       | 
       | * They introduce an operator called "Hyena hierarchy", a
       | recurrence over 2 subquadratic operations: long convolution, and
       | element-wise mul gating. Sec 3.1-3.3 define the recurrences,
       | matrices, and filters. This is importantly, a drop in replacement
       | for attention.
       | 
       | * Longer context: 100x speedup over FlashAttention at 64k context
       | (if we view flash attention as an non-approx engg optimization,
       | then this work is improving algorithmically, and getting OOM over
       | that). Associate recall, i.e., just pull data, show improvements:
       | Experiments on 137k context, and vocab sizes of 10-40 (unsure why
       | they have bad recall on small length sequence with larger vocab,
       | but they still outperform others)
       | 
       | * Comparisons (on relatively small models, but hoping to show
       | pattern) with RWKV (attention-free model, trained on 332B
       | tokens), GPTNeo (trained on 300B tokens), with Hyena trained on
       | 137B tokens. Models are 125M-355M sized. (Section 4.3)
       | 
       | * On SuperGLUE, zero-shot and 3-shot accuracy is ballpark similar
       | to GPTNeo (although technically they underperform a bit for zero-
       | shot and overperform a bit for 3-shot). (Table 4.5 and 4.6)
       | 
       | * Because they can support large (e.g., 100k+) context, they can
       | do image classification. They report ballpark comparable against
       | others. (Table 4.7)
       | 
       | Might have misread some takeaways; happy to be corrected.
        
       | sharemywin wrote:
       | I didn't see anything in the article about what the scaling
       | factor was? less than P^2 but what was it?
        
         | bckr wrote:
         | The paper has a "preliminary scaling law" diagram. The shape of
         | the graph is the same, but with 20% fewer FLOPS.
         | 
         | The real breakthrough is that Hyena apparently has an unlimited
         | context window.
        
           | flangola7 wrote:
           | >The real breakthrough is that Hyena apparently has an
           | unlimited context window.
           | 
           | It's extrapolated volition time (;[?];)
        
             | te0006 wrote:
             | Is it? Removing the context window limit is big, no doubt,
             | but inference still takes time (and compute).
        
               | bckr wrote:
               | I think GP is talking about the ability of an AI to make
               | decisions with reference to context from the past and
               | therefore have a "will extended over time"
        
         | choeger wrote:
         | presumably still O(n2) in theory, but not for practical cases.
         | 
         | I think that anything reolacing attention will suffer quadratic
         | growth for some pathological examples.
         | 
         | maybe if we have a better understanding of the data we could
         | give a better definition (much like graph complexity is usually
         | given in the actual number of edges, which are theoretically
         | O(n2).)
        
       | galaxytachyon wrote:
       | How good is it at scaling? And will it still retain the emergent
       | capabilities of the huge transformer LLMs?
       | 
       | Isn't this basically the bitter lesson again? Making small
       | improvements work but in long term it won't give the same
       | impressive result?
        
         | coldtea wrote:
         | So? Would you rather we didn't make small improvements?
         | 
         | If we could just make big improvements we would.
        
       | skybrian wrote:
       | Blog post:
       | https://hazyresearch.stanford.edu/blog/2023-03-07-hyena
       | 
       | Previous discussion:
       | https://news.ycombinator.com/item?id=35502187
        
       ___________________________________________________________________
       (page generated 2023-04-21 23:03 UTC)