[HN Gopher] Were RNNs all we needed?
___________________________________________________________________
Were RNNs all we needed?
Author : beefman
Score : 212 points
Date : 2024-10-03 17:31 UTC (5 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| tehsauce wrote:
| I haven't gone through the paper in detail yet but maybe someone
| can answer. If you remove the hidden state from an rnn as they
| say they've done, what's left? An mlp predicting from a single
| token?
| statusfailed wrote:
| I only had a quick look, but it looks like they tweaked the
| state update so the model can be run with parallel scan instead
| of having to do it sequentially.
| jfcoa wrote:
| It doesn't completely remove it, it removes certain
| dependencies on it so that it can be computed by parallel scan,
| there is still a hidden state. It bears some similarity to what
| was done with Mamba.
| bunderbunder wrote:
| They didn't remove the hidden state entirely, they just removed
| it from the input, forget and update gates. I haven't digested
| the paper either, but I think that in the case of a GRU this
| means that the hidden state update masking (z_t and r_t in the
| paper's formulas) only depends on the new input, not the input
| plus the prior hidden state.
| _0ffh wrote:
| The trick is to make sure the recursive dependency stays
| linear, that's how you enable parallel training.
| hydrolox wrote:
| Betteridge's law of headlines?
| woah wrote:
| For paper titles, the law is that the answer is always "yes"
| bunderbunder wrote:
| Not always, I think?
|
| Opinions probably differ, for example, on John Backus's paper
| "Can programming be liberated from the Von Neumann style?"
| Many fans of functional programming would say the answer is
| yes, but Backus himself expressed less enthusiasm in
| interviews later in his life.
|
| I think the important point, though, is that academic papers
| and newspaper articles are _not the same_ , and titles in the
| form of questions function differently in the two domains.
| Journalists tend to use titles like these to dissemble and
| sensationalize. When academics use these kinds of titles for
| peer-reviewed articles, it's because they really are asking
| an honest question. Backus was doing it in his paper. The
| authors of this paper are doing the same. They end the paper
| by re-iterating the question before launching into a
| discussion of the limitations that prevent them from reaching
| any firm conclusions on the answer to this question.
| nephanth wrote:
| More like "we aren't sure, but we have good reasons not to
| exclude the possibility"
| hiddencost wrote:
| Note Yoshua Bengio in the author list. This shouldn't be taken
| lightly.
| auggierose wrote:
| And this is where science breaks down.
| hotspot_one wrote:
| Not really, because
|
| 1) Yoshua's reputation would take a hit if this paper were
| bullshit, so he has extrinsic motivation to make it good 2)
| Yoshua has enough experience in the field to know what is
| going on in the field, you don't have to ask if he forgot
| about a certain architecture or the work of a certain
| research group which would contradict his findings-- if such
| work exists and is credible, it is very likely to be
| discussed in the paper. 3) This test answers something a
| leader in the field thinks is important enough for them to
| work on, else he wouldn't be involved.
|
| Also note, the poster said the paper shouldn't be taken
| lightly. That doesn't mean we need to take it blindly. It
| only means we cannot dismiss it out of hand, if we have a
| different view we would need substantive arguments to defend
| our view.
|
| I've overturned the field leader several times in science,
| but that's only because I acknowledged what they got right
| and that they were indeed the person who got it right.
| DAGdug wrote:
| " I've overturned the field leader several times in
| science" Either that makes you a field leader yourself, or
| you did it for trivial things, or you're BSing. Which one
| is it?
| exe34 wrote:
| there's a big space between leader and trivial. it's
| entirely possible to point out the top leader in your
| field is wrong on ten things over a career, without
| becoming the top leader yourself.
| auggierose wrote:
| > It only means we cannot dismiss it out of hand, if we
| have a different view we would need substantive arguments
| to defend our view.
|
| You will need to do that anyway, no matter if Yoshua is on
| the paper, or not. I understand that people have limited
| bandwidth, and so they need shortcuts, and they need to
| justify these shortcuts to themselves somehow (of course
| the justifications are nonsense). Maybe AI will help here.
| imjonse wrote:
| To their credit, the authors (Y. Bengio among them) end the paper
| with the question, not suggesting they know the answer. These
| models are very small even by academic standards so any finding
| would not necessarily extend to current LLM scales. The main
| conclusion is that RNN class networks can be trained as
| efficiently as modern alternatives but the resulting performance
| is only competitive at small scale.
| phkahler wrote:
| >> These models are very small even by academic standards so
| any finding would _not necessarily_ extend to current LLM
| scales.
|
| Emphasis on not necessarily.
|
| >> The main conclusion is that RNN class networks can be
| trained as efficiently as modern alternatives but the resulting
| performance is only competitive at small scale.
|
| Shouldn't the conclusion be "the resulting competitive
| performance has only been confirmed at small scale"?
| xnx wrote:
| It's curse and a blessing that discussion of topics happens in so
| many different places. I found this comment on Twitter/X
| interesting: https://x.com/fchollet/status/1841902521717293273
|
| "Interesting work on reviving RNNs.
| https://arxiv.org/abs/2410.01201 -- in general the fact that
| there are many recent architectures coming from different
| directions that roughly match Transformers is proof that
| architectures aren't fundamentally important in the curve-fitting
| paradigm (aka deep learning)
|
| Curve-fitting is about embedding a dataset on a curve. The
| critical factor is the dataset, not the specific hard-coded bells
| and whistles that constrain the curve's shape. As long as your
| curve is sufficiently expressive all architectures will converge
| to the same performance in the large-data regime."
| islewis wrote:
| > "As long as your curve is sufficiently expressive all
| architectures will converge to the same performance in the
| large-data regime."
|
| I haven't fully ingested the paper yet, but it looks like it's
| focused more on compute optimization than the size of the
| dataset:
|
| > ... and (2) are fully parallelizable during training (175x
| faster for a sequence of length 512
|
| Even if many types of architectures converge to the same loss
| over time, finding the one that converges the fastest is quite
| valuable given the cost of running GPU's at scale.
| teruakohatu wrote:
| > Even if many types of architectures converge to the same
| loss over time, finding the one that converges the fastest is
| quite valuable given the cost of running GPU's at scale.
|
| This! Not just fastest but with the lowest resources in
| total.
|
| Fully connected neural networks are universal functions.
| Technically we don't need anything but a FNN, but memory
| requirements and speed would be abysmal far beyond the realm
| of practicality.
| actionfromafar wrote:
| Unless we could build chips in 3D?
| byearthithatius wrote:
| > finding the one that converges the fastest is quite
| valuable given the cost of running GPU's at scale
|
| Not to him, he runs the ARC challenge. He wants a new
| approach entirely. Something capable of few-shot learning out
| of distribution patterns .... somehow
| acchow wrote:
| What it will come down to is computational efficiencies. We
| don't want to retrain once a month - we want to retrain
| continuously. We don't want one agent talking to 5 LLMs. We
| want thousands of LLMs all working in concert.
| ActorNightly wrote:
| This and also the way models are trained has to be rethought.
| BPP is good for figuring out complex function mappings, but
| not for storing information.
| fsndz wrote:
| after reading this paper, I am now convinced we will need more
| than curve fitting to build
| AGI:https://medium.com/@fsndzomga/there-will-be-no-
| agi-d9be9af44...
| swolchok wrote:
| paper is paywalled; just logging into Medium won't do it
| fsndz wrote:
| sorry for the paywall, you can read the free version here:
| https://www.lycee.ai/blog/why-no-agi-openai
| xpl wrote:
| I would like to read it, but it's under a paywall.
| alwa wrote:
| https://archive.is/nGaiU
| ahzhou wrote:
| Author: @fandzomga Username: fsndz
|
| Why try to funnel us to your paywalled article?
| josh-sematic wrote:
| One reason why I'm excited about o1 is that it seems like
| OpenAI have cracked the nut of effective RL during training
| time, which takes us out of the domain of just fitting to the
| curve of "what a human would have said next." I just finished
| writing a couple blog posts about this; the first [1] covers
| some problems with that approach and the second [2] talks
| about what alternatives might look like.
|
| [1] https://www.airtrain.ai/blog/how-openai-o1-changes-the-
| llm-t... [2] https://www.airtrain.ai/blog/how-
| openai-o1-changes-the-llm-t...
| vineyardmike wrote:
| TLDR: "statistically fitting token output is not the same as
| human intelligence, and human intelligence and AGI are
| contradictory anyways (because humans make mistakes)"
|
| Saved you the paywall click to the poorly structured medium
| article :)
| acchow wrote:
| > After reading this paper, I am now
|
| Is this your paper?
| Lerc wrote:
| I remember one of the initial transformer people saying in an
| interview that they didn't think this was the "one true
| architecture" but a lot of the performance came from people
| rallying around it and pushing in the one direction.
|
| On the other hand, while _" As long as your curve is
| sufficiently expressive all architectures will converge to the
| same performance in the large-data regime."_ is true, a
| sufficiently expressive mechanism may not be computationally or
| memory efficient. As both are constraints on what you can
| actually build, it's not whether the architecture can produce
| the result, but whether a feasible/practical instantiation of
| that architecture can produce the result.
| ants_everywhere wrote:
| > is proof that architectures aren't fundamentally important in
| the curve-fitting paradigm (aka deep learning)
|
| (Somewhat) fun and (somewhat) related fact: there's a whole
| cottage industry of "is all you need" papers
| https://arxiv.org/search/?query=%22is+all+you+need%22&search...
| TaurenHunter wrote:
| Reminds me of the "Considered Harmful" articles:
|
| https://meyerweb.com/eric/comment/chech.html
| jprete wrote:
| I wonder if there's something about tech culture - or tech
| people - that encourages them to really, really like
| snowclones.
| observationist wrote:
| Yes. Do stuff that other people have been successful
| doing. Monkey see, monkey do - it's not a tech people
| thing, it's a human thing.
|
| Tech just happens to be most on display at the moment -
| because tech people are building the tools and the
| parameters and the infrastructure handling all our
| interactions.
| wongarsu wrote:
| One big thing that bells and whistles do is limit the training
| space.
|
| For example when CNNs took over computer vision that wasn't
| because they were doing something that dense networks couldn't
| do. It was because they removed a lot of edges that didn't
| really matter, allowing us to spend our training budget on
| deeper networks. Similarly transformers are great because they
| allow us to train gigantic networks somewhat efficiently. And
| this paper finds that if we make RNNs a lot faster to train
| they are actually pretty good. Training speed and efficiency
| remains the big bottleneck, not the actual expressiveness of
| the architecture
| dheera wrote:
| I mean, transformer-based LLMs are RNNs, just really really
| really big ones with very wide inputs that maintain large
| amounts of context.
| immibis wrote:
| No. An RNN has an arbitrarily-long path from old inputs to
| new outputs, even if in practice it can't exploit that path.
| Transformers have fixed-size input windows.
| og_kalu wrote:
| You can't have a fixed state and have arbitrarily-long path
| from input. Well you can but then it's just meaningless
| because you fundamentally cannot keep stuffing information
| of arbitrary length into a fixed state. RNNs effectively
| have fixed-size input windows.
| immibis wrote:
| The _path_ is arbitrarily _long_ , not wide. It is
| _possible_ for an RNN to be made that remembers the first
| word of the input, no longer how long the input is. This
| is not possible with a transformer, so we know they are
| fundamentally different.
| quotemstr wrote:
| But an RNN isn't _going_ to remember the first token of
| input. It won 't know until it sees the last token
| whether that first token was relevant after all, so it
| has to learn token-specific update rules that let it
| guess how long to hold what kinds of information. (In
| multi-layer systems, the network uses ineffable
| abstractions rather than tokens, but the same idea
| applies.)
|
| What the RNN must be doing reminds me of "sliding window
| attention" --- the model learns how to partition its
| state between short- and long-range memories to minimize
| overall loss. The two approaches seem related, perhaps
| even equivalent up to implementation details.
| OkayPhysicist wrote:
| The most popular RNNs (the ones that were successful
| enough for Google translate and the like) actually had
| this behavior baked in to the architecture, called
| "LSTMs", "Long-Short Term Memory"
| dheera wrote:
| A chunk of the output still goes into the transformer
| input, so the arbitrarily-long path still exists, it just
| goes through a decoding/encoding step.
| quantadev wrote:
| Most LLMs aren't even using a "curve" yet at all, right? All
| they're using is a series of linear equations because the model
| weights are a simple multiply and add (i.e. basic NN
| Perceptron). Sure there's a squashing function on the output to
| keep it in a range from 0 to 1 but that's done BECAUSE we're
| just adding up stuff.
|
| I think probably future NNs will be maybe more adaptive than
| this perhaps where some Perceptrons use sine wave functions, or
| other kinds of math functions, beyond just linear "y=mx+b"
|
| It's astounding that we DID get the emergent intelligence from
| just doing this "curve fitting" onto "lines" rather than actual
| "curves".
| OkayPhysicist wrote:
| The "squashing function" necessarily is nonlinear in
| multilayer nueral networks. A single layer of a neural
| network can be quite simply written a weight matrix, times an
| input vector, equalling an output vector, like so
|
| Ax = y
|
| Adding another layer is just multiplying a different set of
| weights times the output of the first, so
|
| B(Ax)= y
|
| If you remember your linear algebra course, you might see the
| problem: that can be simplified
|
| (BA)x = y
|
| Cx = y
|
| Completely indistinguishable from a single layer, thus only
| capable of modeling linear relationships.
|
| To prevent this collapse, a non linear function must be
| introduced between each layer.
| quantadev wrote:
| Right. All the squashing is doing is keeping the output of
| any neuron in a range of below 1.
|
| But the entire NN itself (Perceptron ones, which most LLMs
| are) is still completely using nothing but linearity to
| store all the knowledge from the training process. All the
| weights are just an 'm' in the basic line equation
| 'y=m*x+b'. The entire training process does nothing but
| adjust a bunch of slopes of a bunch of lines. It's totally
| linear. No non-linearity at all.
| nazgul17 wrote:
| The non linearities are fundamental. Without them, any
| arbitrarily deep NN is equivalent to a shallow NN (easily
| computable, as GP was saying), and we know those can't
| even solve the XOR problem.
|
| > nothing but linearity
|
| No, if you have non linearities, the NN itself is _not_
| linear. The non linearities are not there primarily to
| keep the outputs in a given range, though that 's
| important, too.
| sakras wrote:
| I figured this was pretty obvious given that MLPs are universal
| function approximators. A giant MLP could achieve the same
| results as a transformer. The problem is the scale - we can't
| train a big enough MLP. Transformers are a performance
| optimization, and that's why they're useful.
| m11a wrote:
| It'd be nice to see more of how this compares to Mamba. Looks
| like, in performance, they're not leagues apart and it's just a
| _different_ architecture, not necessarily better or worse?
| dsamarin wrote:
| The name of the paper contrasts with the paper that spawned
| Transformer architecture, which itself is a reference to the song
| "All You Need Is Love" by the Beatles.
| https://en.wikipedia.org/wiki/Attention_Is_All_You_Need
| vundercind wrote:
| I eagerly await the backlash to suggesting any one thing is all
| you need, the first shot of which shall surely be titled: "'All
| you need' Considered Harmful"
| ants_everywhere wrote:
| Surely the universe is all you need though
| marcosdumay wrote:
| R == Recurrent
|
| From theory the answer to the question should be "yes", they are
| Turing complete.
|
| The real question is about how to train them, and the paper is
| about that.
| baanist wrote:
| Why aren't AI researchers automating the search for efficient
| architectures?
| ks2048 wrote:
| https://en.wikipedia.org/wiki/Neural_architecture_search
| kelseyfrog wrote:
| The search space is all off too wide, difficult to
| parameterize, and there is a wide gap between effective and
| ineffective architectures - ie: a very small change can make
| a network effectively DOA.
| hedgehog wrote:
| Notably architecture search was popular for small vision
| nets where the cost of many training runs was low enough. I
| suspect some of the train-then-prune approaches will come
| back, but even there only by the best funded teams.
| ActorNightly wrote:
| There has been some work, but the problem is that its such a
| massive search space. Philosophically speaking, if you look
| at how humans came into existence, you could make an argument
| that the process of evolution from basic lifeforms can be
| represented as one giant compute per minute across of all of
| earth, where genetic selection happens and computation
| proceeds to the next minute. Thats a fuckload of compute.
|
| In more practical terms, you would imagine that an advanced
| model contains some semblance of a CPU to be able to truly
| reason. Given that CPUs can be all NAND gates (which take 2
| neurons to represent), and are structured in a recurrent way,
| you fundamentally have to rethink how to train such a
| network, because backprop obviously won't work to capture
| things like binary decision points.
| baanist wrote:
| I thought the whole point of neural networks was that they
| were good at searching through these spaces. I'm pretty
| sure OpenAI is pruning their models behind the scenes to
| reduce their costs because that's the only way they can
| keep reducing the cost per token. So their secret sauce at
| this point is whatever pruning AI they're using to whittle
| the large computation graphs into more cost efficient
| consumer products.
| slashdave wrote:
| Because no one knows how to iterate over all possible
| architectures? Heck, there are certainly entire classes of
| architectures that no one has even considered.
| jjtheblunt wrote:
| What are you saying is Turing-complete?
| baanist wrote:
| Neural networks are Turing complete, i.e. there is a
| universal neural network that can compute any effectively
| computable function1. Incidentally, when this is combined
| with Rice's theorem2 it means that safety research is
| essentially an unsolvable problem because any non-trivial
| property of a sufficiently complex neural network, e.g. one
| that can simulate a Turing machine, will have properties
| which can not be predicted with finite computation.
|
| 1: https://www.sciencedirect.com/science/article/pii/08939659
| 91...
|
| 2: https://en.wikipedia.org/wiki/Rice%27s_theorem?useskin=vec
| to...
| logicchains wrote:
| The model in the paper isn't a "real" RNN due making it
| parallelizable, for same the reasons described in
| https://arxiv.org/abs/2404.08819 , and hence is theoretically
| less powerful than a "real" RNN (struggles at some classes of
| problems that RNNs traditionally excel at). On the other hand,
| https://arxiv.org/abs/2405.04517 contains a "real" RNN component,
| which demonstrates a significant improvement on the kind of
| state-tracking problems that transformers struggle with.
| robertsdionne wrote:
| These are real RNNs, they still depend upon the prior hidden
| state, it's just that the gating does not. The basic RNN
| equation can be parallelized with parallel prefix scan
| algorithms.
| bob1029 wrote:
| > Transformers required ~2.5x more training steps to achieve
| comparable performance, overfitting eventually.
|
| > RNNs are particularly suitable for sequence modelling settings
| such as those involving time series, natural language processing,
| and other sequential tasks where context from previous steps
| informs the current prediction.
|
| I would like to draw an analogy to digital signal processing. If
| you think of the recurrent-style architectures as IIR filters and
| feedforward-only architectures as FIR filters, you will likely
| find many parallels.
|
| The most obvious to me being that IIR filters typically require
| far fewer elements to produce the same response as an equivalent
| FIR filter. Granted, the FIR filter is often easier to
| implement/control/measure in practical terms (fixed-point
| arithmetic hardware == ML architectures that can run on GPUs).
|
| I don't think we get to the exponential scary part of AI without
| some fundamentally recurrent architecture. I think things like
| LSTM are kind of an in-between hack in this DSP analogy - You
| could look at it as FIR with dynamic coefficients. Neuromorphic
| approaches seem like the best long term bet to me in terms of
| efficiency.
| PunchTornado wrote:
| To me this is further evidence that these LLMs learn only to
| speak English, but there is no reasoning at all in them. If you
| simplify a lot and obtain the same results and we know how
| complex the brain is.
| quantadev wrote:
| Every LLM expert on the planet agrees LLMs are doing
| "reasoning". No one says they have feelings or qualia, but we
| all know there's definitely genuinely artificial reasoning
| happening.
|
| What LLMs have shown both Neuroscience and Computer Science is
| that reasoning is a mechanical process (or can be simulated by
| mechanical processes) and is not purely associated only with
| consciousness.
| adamnemecek wrote:
| Yes, all machine learning can be interpreted in terms of
| approximating the partition function.
|
| This is obvious when one considers the connections between
| Transformers, RNNs, Hopfield networks and the Ising model, a
| model from statistical mechanics which is solved by calculating
| the partition function.
|
| This interpretation provides us with some very powerful tools
| that are commonplace in math and physics but which are not talked
| about in CS & ML.
|
| I'm working on a startup http://traceoid.ai which takes this
| exact view. Our approach enables faster training and inference,
| interpretability and also scalable energy-based models, the Holy
| Grail of machine learning.
|
| Join the discord https://discord.com/invite/mr9TAhpyBW or follow
| me on twitter https://twitter.com/adamnemecek1
| mkaic wrote:
| I strongly enjoy the simplicity of their "minGRU" architecture.
| It's basically just: class MinGRU(nn.Module):
| def __init__(self, token_size, hidden_state_size):
| self.token_to_proposal = nn.Linear(token_size, hidden_size)
| self.token_to_mix_factors = nn.Linear(token_size, hidden_size)
| def forward(self, previous_hidden_state, current_token):
| proposed_hidden_state = self.token_to_proposal(current_token)
| mix_factors =
| torch.sigmoid(self.token_to_mix_factors(current_token))
| return torch.lerp(proposed_hidden_state, previous_hidden_state,
| mix_factors)
|
| And since the proposed hidden states and mix factors for each
| layer are both only dependent on the current token, you can
| compute all of them in parallel if you know the whole sequence
| ahead of time (like during training), and then combine them in
| linear time using parallel scan.
|
| The fact that this is competitive with transformers and state-
| space models in their small-scale experiments is gratifying to
| the "best PRs are the ones that delete code" side of me. That
| said, we won't know for sure if this is a capital-B Breakthrough
| until someone tries scaling it up to parameter and data counts
| comparable to SOTA models.
|
| One detail I found really interesting is that they seem to do all
| their calculations in log-space, according to the Appendix. They
| say it's for numerical stability, which is curious to me--I'm not
| sure I have a good intuition for why running everything in log-
| space makes the model more stable. Is it because they removed the
| tanh from the output, making it possible for values to explode if
| calculations are done in linear space?
|
| EDIT: Another thought--it's kind of fascinating that this sort of
| sequence modeling works at all. It's like if I gave you all the
| pages of a book individually torn out and in a random order, and
| asked you to try to make a vector representation for each page as
| well as instructions for how to mix that vector with the vector
| representing all previous pages -- except you have zero knowledge
| of those previous pages. Then, I take all your page vectors,
| sequentially mix them together in-order, and grade you based on
| how good of a whole-book summary the final vector represents.
| Wild stuff.
|
| FURTHER EDIT: Yet _another_ thought--right now, they 're just
| using two dense linear layers to transform the token into the
| proposed hidden state and the lerp mix factors. I'm curious what
| would happen if you made those transforms MLPs instead of
| singular linear layers.
| immibis wrote:
| This architecture, on the surface, seems to preclude the basic
| function of recognizing sequences of tokens. At the very least,
| it seems like it should suffer from something like the pumping
| lemma: if [the ][cat ][is ][black ] results in the output
| getting close to a certain vector, [the ][cat ][is ][black
| ][the ][cat ][is ][black ][the ][cat ][is ][black ] should get
| even closer to that vector and nowhere close to a "why did you
| just repeat the same sentence three times" vector? Without non-
| linear mixing between input token and hidden state, there will
| be a lot of linear similarities between similar token
| sequences...
| mkaic wrote:
| Counterpoint: the hidden state at the beginning of
| ([the][cat][is][black]) x 3 is (probably) initialized to all
| zeros, but after seeing those first 4 tokens, it will _not_
| be all zeros. Thus, going into the second repetition of the
| sentence, the model has a different initial hidden state, and
| should exhibit different behavior. I think this makes it
| possible for the model to learn to recognize repeated
| sequences and avoid your proposed pitfall.
| slashdave wrote:
| Log space is important if the token probabilities span a large
| range of values (powers). There is a reason that maximum
| likelihood fitting is always performed with log likelihoods.
| trott wrote:
| My feeling is that the answer is "no", in the sense that these
| RNNs wouldn't be able to universally replace Transformers in
| LLMs, even though they might be good enough in some cases and
| beat them in others.
|
| Here's why.
|
| A user of an LLM _might_ give the model some long text and then
| say "Translate this into German please". A Transformer can look
| back at its whole history. But what is an RNN to do? While the
| length of its context is unlimited, the amount of information the
| model retains about it is bounded by whatever is in its hidden
| state at any given time.
|
| Relevant: https://arxiv.org/abs/2402.01032
| mkaic wrote:
| The counterargument here is that you can just scale the size of
| the hidden state sufficiently such that it can hold compressed
| representations of whatever-length sequence you like.
| Ultimately, what I care about is whether RNNs could compete
| with transformers if FLOPs are held constant--something TFA
| doesn't really investigate.
| psb217 wrote:
| Well, that's what Transformer already does... One problem
| with the scaling you're describing is that there would be a
| massive amount of redundant information stored in hidden
| activations during training the RNN. The hidden state at each
| time step t in the sequence would need to contain all info
| that (i) could be useful for predicting the token at time t
| and (ii) that could be useful for predicting tokens at times
| >t. (i) is obvious and (ii) is since all information about
| the past is transferred to future predictions through the
| current hidden state. In principle, Transformers can avoid
| storing redundant info in multiple hidden states at the cost
| of having to maintain and access (via attention) a larger
| hidden state at test/eval time.
| mkaic wrote:
| > there would be a massive amount of redundant information
| stored in hidden activations
|
| Is there a way to prove this? One potential caveat that
| comes to mind for me is that perhaps the action of lerping
| between the old state and the new could be used by the
| model to perform semantically meaningful transformations on
| the old state. I guess in my mind it just doesn't seem
| obvious that the hidden state is necessarily a collection
| of "redundant information" -- perhaps the information is
| culled/distilled the further along in the sequence you go?
| There will always be _some_ redundancy, sure, but I don 't
| think that such redundancy necessarily means we _have_ to
| use superlinear methods like attention.
| phkahler wrote:
| >> A user of an LLM might give the model some long text and
| then say "Translate this into German please". A Transformer can
| look back at its whole history.
|
| Which isn't necessary. If you say "translate the following to
| german." Instead, all it needs is to remember the task at hand
| and a much smaller amount of recent input. Well, and the
| ability to output in parallel with processing input.
| og_kalu wrote:
| It's necessary for arbitrary information processing if you
| can forget and have no way to "unforget".
|
| A model can decide to forget something that turns out to be
| important for some future prediction. A human can go back and
| re-read/listen etc, A transformer is always re-reading but a
| RNN can't and is fucked.
| magicalhippo wrote:
| That's just because we twisted it's arm. One could for
| example feed the reversed input after, ie abc|cba where |
| is a special token. That would allow it to react to any
| part of the message.
| trott wrote:
| People did something similar to what you are describing 10
| years ago: https://arxiv.org/abs/1409.0473
|
| But it's trained on translations, rather than the whole
| Internet.
| DoctorOetker wrote:
| Also, a lightweight network could do a first pass to identify
| tasks, instructions, constraints etc, and then a second pass
| could use the RNN.
|
| Consider the flood fill algorithm or union-find algorithm,
| which feels magical upon first exposure.
|
| https://en.wikipedia.org/wiki/Hoshen%E2%80%93Kopelman_algori.
| ..
|
| Having 2 passes can enable so much more than a single pass.
|
| Another alternative could be to have a first pass make notes
| in a separate buffer while parsing the input. The bandwidth
| of the note taking and reading can be much much lower than
| that required for fetching the billions of parameters.
| slashdave wrote:
| > the amount of information the model retains about it is
| bounded by whatever is in its hidden state
|
| This is no different than a transformer, which, after all, is
| bound by a finite state, just organized in a different manner.
| fhdsgbbcaA wrote:
| We really need a [preprint] flag for unreviewed papers.
| limapedro wrote:
| This is such a interesting paper, sadly they don't have big
| models, I'd like to see a model trained on TinyStories or even C4
| since it should be faster than the transformer variant and see
| how it compares.
| charlescurt123 wrote:
| I find the entire field lacking when it comes to long-horizon
| problems. Our current, widely used solution is to scale, but
| we're nowhere near achieving the horizon scales even small mammal
| brains can handle. Our models can have trillions of parameters,
| yet a mouse brain would still outperform them on long-horizon
| tasks and efficiency. It's something small, simple, and elegant--
| an incredible search algorithm that not only finds near-optimal
| routes but also continuously learns on a fixed computational
| budget.
|
| I'm honestly a bit envious of future engineers who will be
| tackling these kinds of problems with a 100-line Jupyter notebook
| on a laptop years from now. If we discovered the right method or
| algorithm for these long-horizon problems, a 2B-parameter model
| might even outperform current models on everything except short,
| extreme reasoning problems.
|
| The only solution I've ever considered for this is expanding a
| model's dimensionality over time, rather than focusing on perfect
| weights. The higher dimensionality you can provide to a model,
| the greater its theoretical storage capacity. This could resemble
| a two-layer model--one layer acting as a superposition of
| multiple ideal points, and the other layer knowing how to use
| them.
|
| When you think about the loss landscape, imagine it with many
| minima for a given task. If we could create a method that
| navigates these minima by reconfiguring the model when needed, we
| could theoretically develop a single model with near-infinite
| local minima--and therefore, higher-dimensional memory. This may
| sound wild, but consider the fact that the human brain
| potentially creates and disconnects thousands of new connections
| in a single day. Could it be that these connections steer our
| internal loss landscape between different minima we need
| throughout the day?
___________________________________________________________________
(page generated 2024-10-03 23:00 UTC)