[HN Gopher] New technology could blow away GPT-4 and everything ...
___________________________________________________________________
New technology could blow away GPT-4 and everything like it
Author : andy_threos_io
Score : 60 points
Date : 2023-04-21 16:54 UTC (6 hours ago)
(HTM) web link (www.zdnet.com)
(TXT) w3m dump (www.zdnet.com)
| barbariangrunge wrote:
| > At 64,000 tokens, the authors relate, "Hyena speed-ups reach
| 100x" -- a one-hundred-fold performance improvement.
|
| That's quite the difference
| sottol wrote:
| Classic attention is quadratic in context length and faster
| alternatives seem to not perform as well, wonder how Hyena is
| in comparison to linear attention algorithms.
| PaulHoule wrote:
| I had so much fun with CNN models just before BERT hit it big. It
| would be nice to see them make a comeback.
| saurabh20n wrote:
| Notes from quick read of paper at
| https://arxiv.org/abs/2302.10866. Title of popsci is
| overreaching, this is a drop-in subquadratic replacement for
| attention. Could be promising, but to be seen if it is adopted in
| practice. skybrian
| (https://news.ycombinator.com/item?id=35657983) points out new
| blog post by authors, and prev discussion of older (march 28th)
| blog post. Takeaways:
|
| * In standard attention in transformers, cost scales
| quadratically with length of sequence, which restricts model
| context. This work presents subquadratic exact operator allowing
| it to scale to larger contexts (100k+).
|
| * They introduce an operator called "Hyena hierarchy", a
| recurrence over 2 subquadratic operations: long convolution, and
| element-wise mul gating. Sec 3.1-3.3 define the recurrences,
| matrices, and filters. This is importantly, a drop in replacement
| for attention.
|
| * Longer context: 100x speedup over FlashAttention at 64k context
| (if we view flash attention as an non-approx engg optimization,
| then this work is improving algorithmically, and getting OOM over
| that). Associate recall, i.e., just pull data, show improvements:
| Experiments on 137k context, and vocab sizes of 10-40 (unsure why
| they have bad recall on small length sequence with larger vocab,
| but they still outperform others)
|
| * Comparisons (on relatively small models, but hoping to show
| pattern) with RWKV (attention-free model, trained on 332B
| tokens), GPTNeo (trained on 300B tokens), with Hyena trained on
| 137B tokens. Models are 125M-355M sized. (Section 4.3)
|
| * On SuperGLUE, zero-shot and 3-shot accuracy is ballpark similar
| to GPTNeo (although technically they underperform a bit for zero-
| shot and overperform a bit for 3-shot). (Table 4.5 and 4.6)
|
| * Because they can support large (e.g., 100k+) context, they can
| do image classification. They report ballpark comparable against
| others. (Table 4.7)
|
| Might have misread some takeaways; happy to be corrected.
| sharemywin wrote:
| I didn't see anything in the article about what the scaling
| factor was? less than P^2 but what was it?
| bckr wrote:
| The paper has a "preliminary scaling law" diagram. The shape of
| the graph is the same, but with 20% fewer FLOPS.
|
| The real breakthrough is that Hyena apparently has an unlimited
| context window.
| flangola7 wrote:
| >The real breakthrough is that Hyena apparently has an
| unlimited context window.
|
| It's extrapolated volition time (;[?];)
| te0006 wrote:
| Is it? Removing the context window limit is big, no doubt,
| but inference still takes time (and compute).
| bckr wrote:
| I think GP is talking about the ability of an AI to make
| decisions with reference to context from the past and
| therefore have a "will extended over time"
| choeger wrote:
| presumably still O(n2) in theory, but not for practical cases.
|
| I think that anything reolacing attention will suffer quadratic
| growth for some pathological examples.
|
| maybe if we have a better understanding of the data we could
| give a better definition (much like graph complexity is usually
| given in the actual number of edges, which are theoretically
| O(n2).)
| galaxytachyon wrote:
| How good is it at scaling? And will it still retain the emergent
| capabilities of the huge transformer LLMs?
|
| Isn't this basically the bitter lesson again? Making small
| improvements work but in long term it won't give the same
| impressive result?
| coldtea wrote:
| So? Would you rather we didn't make small improvements?
|
| If we could just make big improvements we would.
| skybrian wrote:
| Blog post:
| https://hazyresearch.stanford.edu/blog/2023-03-07-hyena
|
| Previous discussion:
| https://news.ycombinator.com/item?id=35502187
___________________________________________________________________
(page generated 2023-04-21 23:03 UTC)