[HN Gopher] Were RNNs all we needed?
       ___________________________________________________________________
        
       Were RNNs all we needed?
        
       Author : beefman
       Score  : 212 points
       Date   : 2024-10-03 17:31 UTC (5 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | tehsauce wrote:
       | I haven't gone through the paper in detail yet but maybe someone
       | can answer. If you remove the hidden state from an rnn as they
       | say they've done, what's left? An mlp predicting from a single
       | token?
        
         | statusfailed wrote:
         | I only had a quick look, but it looks like they tweaked the
         | state update so the model can be run with parallel scan instead
         | of having to do it sequentially.
        
         | jfcoa wrote:
         | It doesn't completely remove it, it removes certain
         | dependencies on it so that it can be computed by parallel scan,
         | there is still a hidden state. It bears some similarity to what
         | was done with Mamba.
        
         | bunderbunder wrote:
         | They didn't remove the hidden state entirely, they just removed
         | it from the input, forget and update gates. I haven't digested
         | the paper either, but I think that in the case of a GRU this
         | means that the hidden state update masking (z_t and r_t in the
         | paper's formulas) only depends on the new input, not the input
         | plus the prior hidden state.
        
         | _0ffh wrote:
         | The trick is to make sure the recursive dependency stays
         | linear, that's how you enable parallel training.
        
       | hydrolox wrote:
       | Betteridge's law of headlines?
        
         | woah wrote:
         | For paper titles, the law is that the answer is always "yes"
        
           | bunderbunder wrote:
           | Not always, I think?
           | 
           | Opinions probably differ, for example, on John Backus's paper
           | "Can programming be liberated from the Von Neumann style?"
           | Many fans of functional programming would say the answer is
           | yes, but Backus himself expressed less enthusiasm in
           | interviews later in his life.
           | 
           | I think the important point, though, is that academic papers
           | and newspaper articles are _not the same_ , and titles in the
           | form of questions function differently in the two domains.
           | Journalists tend to use titles like these to dissemble and
           | sensationalize. When academics use these kinds of titles for
           | peer-reviewed articles, it's because they really are asking
           | an honest question. Backus was doing it in his paper. The
           | authors of this paper are doing the same. They end the paper
           | by re-iterating the question before launching into a
           | discussion of the limitations that prevent them from reaching
           | any firm conclusions on the answer to this question.
        
           | nephanth wrote:
           | More like "we aren't sure, but we have good reasons not to
           | exclude the possibility"
        
       | hiddencost wrote:
       | Note Yoshua Bengio in the author list. This shouldn't be taken
       | lightly.
        
         | auggierose wrote:
         | And this is where science breaks down.
        
           | hotspot_one wrote:
           | Not really, because
           | 
           | 1) Yoshua's reputation would take a hit if this paper were
           | bullshit, so he has extrinsic motivation to make it good 2)
           | Yoshua has enough experience in the field to know what is
           | going on in the field, you don't have to ask if he forgot
           | about a certain architecture or the work of a certain
           | research group which would contradict his findings-- if such
           | work exists and is credible, it is very likely to be
           | discussed in the paper. 3) This test answers something a
           | leader in the field thinks is important enough for them to
           | work on, else he wouldn't be involved.
           | 
           | Also note, the poster said the paper shouldn't be taken
           | lightly. That doesn't mean we need to take it blindly. It
           | only means we cannot dismiss it out of hand, if we have a
           | different view we would need substantive arguments to defend
           | our view.
           | 
           | I've overturned the field leader several times in science,
           | but that's only because I acknowledged what they got right
           | and that they were indeed the person who got it right.
        
             | DAGdug wrote:
             | " I've overturned the field leader several times in
             | science" Either that makes you a field leader yourself, or
             | you did it for trivial things, or you're BSing. Which one
             | is it?
        
               | exe34 wrote:
               | there's a big space between leader and trivial. it's
               | entirely possible to point out the top leader in your
               | field is wrong on ten things over a career, without
               | becoming the top leader yourself.
        
             | auggierose wrote:
             | > It only means we cannot dismiss it out of hand, if we
             | have a different view we would need substantive arguments
             | to defend our view.
             | 
             | You will need to do that anyway, no matter if Yoshua is on
             | the paper, or not. I understand that people have limited
             | bandwidth, and so they need shortcuts, and they need to
             | justify these shortcuts to themselves somehow (of course
             | the justifications are nonsense). Maybe AI will help here.
        
       | imjonse wrote:
       | To their credit, the authors (Y. Bengio among them) end the paper
       | with the question, not suggesting they know the answer. These
       | models are very small even by academic standards so any finding
       | would not necessarily extend to current LLM scales. The main
       | conclusion is that RNN class networks can be trained as
       | efficiently as modern alternatives but the resulting performance
       | is only competitive at small scale.
        
         | phkahler wrote:
         | >> These models are very small even by academic standards so
         | any finding would _not necessarily_ extend to current LLM
         | scales.
         | 
         | Emphasis on not necessarily.
         | 
         | >> The main conclusion is that RNN class networks can be
         | trained as efficiently as modern alternatives but the resulting
         | performance is only competitive at small scale.
         | 
         | Shouldn't the conclusion be "the resulting competitive
         | performance has only been confirmed at small scale"?
        
       | xnx wrote:
       | It's curse and a blessing that discussion of topics happens in so
       | many different places. I found this comment on Twitter/X
       | interesting: https://x.com/fchollet/status/1841902521717293273
       | 
       | "Interesting work on reviving RNNs.
       | https://arxiv.org/abs/2410.01201 -- in general the fact that
       | there are many recent architectures coming from different
       | directions that roughly match Transformers is proof that
       | architectures aren't fundamentally important in the curve-fitting
       | paradigm (aka deep learning)
       | 
       | Curve-fitting is about embedding a dataset on a curve. The
       | critical factor is the dataset, not the specific hard-coded bells
       | and whistles that constrain the curve's shape. As long as your
       | curve is sufficiently expressive all architectures will converge
       | to the same performance in the large-data regime."
        
         | islewis wrote:
         | > "As long as your curve is sufficiently expressive all
         | architectures will converge to the same performance in the
         | large-data regime."
         | 
         | I haven't fully ingested the paper yet, but it looks like it's
         | focused more on compute optimization than the size of the
         | dataset:
         | 
         | > ... and (2) are fully parallelizable during training (175x
         | faster for a sequence of length 512
         | 
         | Even if many types of architectures converge to the same loss
         | over time, finding the one that converges the fastest is quite
         | valuable given the cost of running GPU's at scale.
        
           | teruakohatu wrote:
           | > Even if many types of architectures converge to the same
           | loss over time, finding the one that converges the fastest is
           | quite valuable given the cost of running GPU's at scale.
           | 
           | This! Not just fastest but with the lowest resources in
           | total.
           | 
           | Fully connected neural networks are universal functions.
           | Technically we don't need anything but a FNN, but memory
           | requirements and speed would be abysmal far beyond the realm
           | of practicality.
        
             | actionfromafar wrote:
             | Unless we could build chips in 3D?
        
           | byearthithatius wrote:
           | > finding the one that converges the fastest is quite
           | valuable given the cost of running GPU's at scale
           | 
           | Not to him, he runs the ARC challenge. He wants a new
           | approach entirely. Something capable of few-shot learning out
           | of distribution patterns .... somehow
        
         | acchow wrote:
         | What it will come down to is computational efficiencies. We
         | don't want to retrain once a month - we want to retrain
         | continuously. We don't want one agent talking to 5 LLMs. We
         | want thousands of LLMs all working in concert.
        
           | ActorNightly wrote:
           | This and also the way models are trained has to be rethought.
           | BPP is good for figuring out complex function mappings, but
           | not for storing information.
        
         | fsndz wrote:
         | after reading this paper, I am now convinced we will need more
         | than curve fitting to build
         | AGI:https://medium.com/@fsndzomga/there-will-be-no-
         | agi-d9be9af44...
        
           | swolchok wrote:
           | paper is paywalled; just logging into Medium won't do it
        
             | fsndz wrote:
             | sorry for the paywall, you can read the free version here:
             | https://www.lycee.ai/blog/why-no-agi-openai
        
           | xpl wrote:
           | I would like to read it, but it's under a paywall.
        
             | alwa wrote:
             | https://archive.is/nGaiU
        
           | ahzhou wrote:
           | Author: @fandzomga Username: fsndz
           | 
           | Why try to funnel us to your paywalled article?
        
           | josh-sematic wrote:
           | One reason why I'm excited about o1 is that it seems like
           | OpenAI have cracked the nut of effective RL during training
           | time, which takes us out of the domain of just fitting to the
           | curve of "what a human would have said next." I just finished
           | writing a couple blog posts about this; the first [1] covers
           | some problems with that approach and the second [2] talks
           | about what alternatives might look like.
           | 
           | [1] https://www.airtrain.ai/blog/how-openai-o1-changes-the-
           | llm-t... [2] https://www.airtrain.ai/blog/how-
           | openai-o1-changes-the-llm-t...
        
           | vineyardmike wrote:
           | TLDR: "statistically fitting token output is not the same as
           | human intelligence, and human intelligence and AGI are
           | contradictory anyways (because humans make mistakes)"
           | 
           | Saved you the paywall click to the poorly structured medium
           | article :)
        
           | acchow wrote:
           | > After reading this paper, I am now
           | 
           | Is this your paper?
        
         | Lerc wrote:
         | I remember one of the initial transformer people saying in an
         | interview that they didn't think this was the "one true
         | architecture" but a lot of the performance came from people
         | rallying around it and pushing in the one direction.
         | 
         | On the other hand, while _" As long as your curve is
         | sufficiently expressive all architectures will converge to the
         | same performance in the large-data regime."_ is true, a
         | sufficiently expressive mechanism may not be computationally or
         | memory efficient. As both are constraints on what you can
         | actually build, it's not whether the architecture can produce
         | the result, but whether a feasible/practical instantiation of
         | that architecture can produce the result.
        
         | ants_everywhere wrote:
         | > is proof that architectures aren't fundamentally important in
         | the curve-fitting paradigm (aka deep learning)
         | 
         | (Somewhat) fun and (somewhat) related fact: there's a whole
         | cottage industry of "is all you need" papers
         | https://arxiv.org/search/?query=%22is+all+you+need%22&search...
        
           | TaurenHunter wrote:
           | Reminds me of the "Considered Harmful" articles:
           | 
           | https://meyerweb.com/eric/comment/chech.html
        
             | jprete wrote:
             | I wonder if there's something about tech culture - or tech
             | people - that encourages them to really, really like
             | snowclones.
        
               | observationist wrote:
               | Yes. Do stuff that other people have been successful
               | doing. Monkey see, monkey do - it's not a tech people
               | thing, it's a human thing.
               | 
               | Tech just happens to be most on display at the moment -
               | because tech people are building the tools and the
               | parameters and the infrastructure handling all our
               | interactions.
        
         | wongarsu wrote:
         | One big thing that bells and whistles do is limit the training
         | space.
         | 
         | For example when CNNs took over computer vision that wasn't
         | because they were doing something that dense networks couldn't
         | do. It was because they removed a lot of edges that didn't
         | really matter, allowing us to spend our training budget on
         | deeper networks. Similarly transformers are great because they
         | allow us to train gigantic networks somewhat efficiently. And
         | this paper finds that if we make RNNs a lot faster to train
         | they are actually pretty good. Training speed and efficiency
         | remains the big bottleneck, not the actual expressiveness of
         | the architecture
        
         | dheera wrote:
         | I mean, transformer-based LLMs are RNNs, just really really
         | really big ones with very wide inputs that maintain large
         | amounts of context.
        
           | immibis wrote:
           | No. An RNN has an arbitrarily-long path from old inputs to
           | new outputs, even if in practice it can't exploit that path.
           | Transformers have fixed-size input windows.
        
             | og_kalu wrote:
             | You can't have a fixed state and have arbitrarily-long path
             | from input. Well you can but then it's just meaningless
             | because you fundamentally cannot keep stuffing information
             | of arbitrary length into a fixed state. RNNs effectively
             | have fixed-size input windows.
        
               | immibis wrote:
               | The _path_ is arbitrarily _long_ , not wide. It is
               | _possible_ for an RNN to be made that remembers the first
               | word of the input, no longer how long the input is. This
               | is not possible with a transformer, so we know they are
               | fundamentally different.
        
               | quotemstr wrote:
               | But an RNN isn't _going_ to remember the first token of
               | input. It won 't know until it sees the last token
               | whether that first token was relevant after all, so it
               | has to learn token-specific update rules that let it
               | guess how long to hold what kinds of information. (In
               | multi-layer systems, the network uses ineffable
               | abstractions rather than tokens, but the same idea
               | applies.)
               | 
               | What the RNN must be doing reminds me of "sliding window
               | attention" --- the model learns how to partition its
               | state between short- and long-range memories to minimize
               | overall loss. The two approaches seem related, perhaps
               | even equivalent up to implementation details.
        
               | OkayPhysicist wrote:
               | The most popular RNNs (the ones that were successful
               | enough for Google translate and the like) actually had
               | this behavior baked in to the architecture, called
               | "LSTMs", "Long-Short Term Memory"
        
             | dheera wrote:
             | A chunk of the output still goes into the transformer
             | input, so the arbitrarily-long path still exists, it just
             | goes through a decoding/encoding step.
        
         | quantadev wrote:
         | Most LLMs aren't even using a "curve" yet at all, right? All
         | they're using is a series of linear equations because the model
         | weights are a simple multiply and add (i.e. basic NN
         | Perceptron). Sure there's a squashing function on the output to
         | keep it in a range from 0 to 1 but that's done BECAUSE we're
         | just adding up stuff.
         | 
         | I think probably future NNs will be maybe more adaptive than
         | this perhaps where some Perceptrons use sine wave functions, or
         | other kinds of math functions, beyond just linear "y=mx+b"
         | 
         | It's astounding that we DID get the emergent intelligence from
         | just doing this "curve fitting" onto "lines" rather than actual
         | "curves".
        
           | OkayPhysicist wrote:
           | The "squashing function" necessarily is nonlinear in
           | multilayer nueral networks. A single layer of a neural
           | network can be quite simply written a weight matrix, times an
           | input vector, equalling an output vector, like so
           | 
           | Ax = y
           | 
           | Adding another layer is just multiplying a different set of
           | weights times the output of the first, so
           | 
           | B(Ax)= y
           | 
           | If you remember your linear algebra course, you might see the
           | problem: that can be simplified
           | 
           | (BA)x = y
           | 
           | Cx = y
           | 
           | Completely indistinguishable from a single layer, thus only
           | capable of modeling linear relationships.
           | 
           | To prevent this collapse, a non linear function must be
           | introduced between each layer.
        
             | quantadev wrote:
             | Right. All the squashing is doing is keeping the output of
             | any neuron in a range of below 1.
             | 
             | But the entire NN itself (Perceptron ones, which most LLMs
             | are) is still completely using nothing but linearity to
             | store all the knowledge from the training process. All the
             | weights are just an 'm' in the basic line equation
             | 'y=m*x+b'. The entire training process does nothing but
             | adjust a bunch of slopes of a bunch of lines. It's totally
             | linear. No non-linearity at all.
        
               | nazgul17 wrote:
               | The non linearities are fundamental. Without them, any
               | arbitrarily deep NN is equivalent to a shallow NN (easily
               | computable, as GP was saying), and we know those can't
               | even solve the XOR problem.
               | 
               | > nothing but linearity
               | 
               | No, if you have non linearities, the NN itself is _not_
               | linear. The non linearities are not there primarily to
               | keep the outputs in a given range, though that 's
               | important, too.
        
         | sakras wrote:
         | I figured this was pretty obvious given that MLPs are universal
         | function approximators. A giant MLP could achieve the same
         | results as a transformer. The problem is the scale - we can't
         | train a big enough MLP. Transformers are a performance
         | optimization, and that's why they're useful.
        
       | m11a wrote:
       | It'd be nice to see more of how this compares to Mamba. Looks
       | like, in performance, they're not leagues apart and it's just a
       | _different_ architecture, not necessarily better or worse?
        
       | dsamarin wrote:
       | The name of the paper contrasts with the paper that spawned
       | Transformer architecture, which itself is a reference to the song
       | "All You Need Is Love" by the Beatles.
       | https://en.wikipedia.org/wiki/Attention_Is_All_You_Need
        
         | vundercind wrote:
         | I eagerly await the backlash to suggesting any one thing is all
         | you need, the first shot of which shall surely be titled: "'All
         | you need' Considered Harmful"
        
           | ants_everywhere wrote:
           | Surely the universe is all you need though
        
       | marcosdumay wrote:
       | R == Recurrent
       | 
       | From theory the answer to the question should be "yes", they are
       | Turing complete.
       | 
       | The real question is about how to train them, and the paper is
       | about that.
        
         | baanist wrote:
         | Why aren't AI researchers automating the search for efficient
         | architectures?
        
           | ks2048 wrote:
           | https://en.wikipedia.org/wiki/Neural_architecture_search
        
           | kelseyfrog wrote:
           | The search space is all off too wide, difficult to
           | parameterize, and there is a wide gap between effective and
           | ineffective architectures - ie: a very small change can make
           | a network effectively DOA.
        
             | hedgehog wrote:
             | Notably architecture search was popular for small vision
             | nets where the cost of many training runs was low enough. I
             | suspect some of the train-then-prune approaches will come
             | back, but even there only by the best funded teams.
        
           | ActorNightly wrote:
           | There has been some work, but the problem is that its such a
           | massive search space. Philosophically speaking, if you look
           | at how humans came into existence, you could make an argument
           | that the process of evolution from basic lifeforms can be
           | represented as one giant compute per minute across of all of
           | earth, where genetic selection happens and computation
           | proceeds to the next minute. Thats a fuckload of compute.
           | 
           | In more practical terms, you would imagine that an advanced
           | model contains some semblance of a CPU to be able to truly
           | reason. Given that CPUs can be all NAND gates (which take 2
           | neurons to represent), and are structured in a recurrent way,
           | you fundamentally have to rethink how to train such a
           | network, because backprop obviously won't work to capture
           | things like binary decision points.
        
             | baanist wrote:
             | I thought the whole point of neural networks was that they
             | were good at searching through these spaces. I'm pretty
             | sure OpenAI is pruning their models behind the scenes to
             | reduce their costs because that's the only way they can
             | keep reducing the cost per token. So their secret sauce at
             | this point is whatever pruning AI they're using to whittle
             | the large computation graphs into more cost efficient
             | consumer products.
        
           | slashdave wrote:
           | Because no one knows how to iterate over all possible
           | architectures? Heck, there are certainly entire classes of
           | architectures that no one has even considered.
        
         | jjtheblunt wrote:
         | What are you saying is Turing-complete?
        
           | baanist wrote:
           | Neural networks are Turing complete, i.e. there is a
           | universal neural network that can compute any effectively
           | computable function1. Incidentally, when this is combined
           | with Rice's theorem2 it means that safety research is
           | essentially an unsolvable problem because any non-trivial
           | property of a sufficiently complex neural network, e.g. one
           | that can simulate a Turing machine, will have properties
           | which can not be predicted with finite computation.
           | 
           | 1: https://www.sciencedirect.com/science/article/pii/08939659
           | 91...
           | 
           | 2: https://en.wikipedia.org/wiki/Rice%27s_theorem?useskin=vec
           | to...
        
       | logicchains wrote:
       | The model in the paper isn't a "real" RNN due making it
       | parallelizable, for same the reasons described in
       | https://arxiv.org/abs/2404.08819 , and hence is theoretically
       | less powerful than a "real" RNN (struggles at some classes of
       | problems that RNNs traditionally excel at). On the other hand,
       | https://arxiv.org/abs/2405.04517 contains a "real" RNN component,
       | which demonstrates a significant improvement on the kind of
       | state-tracking problems that transformers struggle with.
        
         | robertsdionne wrote:
         | These are real RNNs, they still depend upon the prior hidden
         | state, it's just that the gating does not. The basic RNN
         | equation can be parallelized with parallel prefix scan
         | algorithms.
        
       | bob1029 wrote:
       | > Transformers required ~2.5x more training steps to achieve
       | comparable performance, overfitting eventually.
       | 
       | > RNNs are particularly suitable for sequence modelling settings
       | such as those involving time series, natural language processing,
       | and other sequential tasks where context from previous steps
       | informs the current prediction.
       | 
       | I would like to draw an analogy to digital signal processing. If
       | you think of the recurrent-style architectures as IIR filters and
       | feedforward-only architectures as FIR filters, you will likely
       | find many parallels.
       | 
       | The most obvious to me being that IIR filters typically require
       | far fewer elements to produce the same response as an equivalent
       | FIR filter. Granted, the FIR filter is often easier to
       | implement/control/measure in practical terms (fixed-point
       | arithmetic hardware == ML architectures that can run on GPUs).
       | 
       | I don't think we get to the exponential scary part of AI without
       | some fundamentally recurrent architecture. I think things like
       | LSTM are kind of an in-between hack in this DSP analogy - You
       | could look at it as FIR with dynamic coefficients. Neuromorphic
       | approaches seem like the best long term bet to me in terms of
       | efficiency.
        
       | PunchTornado wrote:
       | To me this is further evidence that these LLMs learn only to
       | speak English, but there is no reasoning at all in them. If you
       | simplify a lot and obtain the same results and we know how
       | complex the brain is.
        
         | quantadev wrote:
         | Every LLM expert on the planet agrees LLMs are doing
         | "reasoning". No one says they have feelings or qualia, but we
         | all know there's definitely genuinely artificial reasoning
         | happening.
         | 
         | What LLMs have shown both Neuroscience and Computer Science is
         | that reasoning is a mechanical process (or can be simulated by
         | mechanical processes) and is not purely associated only with
         | consciousness.
        
       | adamnemecek wrote:
       | Yes, all machine learning can be interpreted in terms of
       | approximating the partition function.
       | 
       | This is obvious when one considers the connections between
       | Transformers, RNNs, Hopfield networks and the Ising model, a
       | model from statistical mechanics which is solved by calculating
       | the partition function.
       | 
       | This interpretation provides us with some very powerful tools
       | that are commonplace in math and physics but which are not talked
       | about in CS & ML.
       | 
       | I'm working on a startup http://traceoid.ai which takes this
       | exact view. Our approach enables faster training and inference,
       | interpretability and also scalable energy-based models, the Holy
       | Grail of machine learning.
       | 
       | Join the discord https://discord.com/invite/mr9TAhpyBW or follow
       | me on twitter https://twitter.com/adamnemecek1
        
       | mkaic wrote:
       | I strongly enjoy the simplicity of their "minGRU" architecture.
       | It's basically just:                 class MinGRU(nn.Module):
       | def __init__(self, token_size, hidden_state_size):
       | self.token_to_proposal = nn.Linear(token_size, hidden_size)
       | self.token_to_mix_factors = nn.Linear(token_size, hidden_size)
       | def forward(self, previous_hidden_state, current_token):
       | proposed_hidden_state = self.token_to_proposal(current_token)
       | mix_factors =
       | torch.sigmoid(self.token_to_mix_factors(current_token))
       | return torch.lerp(proposed_hidden_state, previous_hidden_state,
       | mix_factors)
       | 
       | And since the proposed hidden states and mix factors for each
       | layer are both only dependent on the current token, you can
       | compute all of them in parallel if you know the whole sequence
       | ahead of time (like during training), and then combine them in
       | linear time using parallel scan.
       | 
       | The fact that this is competitive with transformers and state-
       | space models in their small-scale experiments is gratifying to
       | the "best PRs are the ones that delete code" side of me. That
       | said, we won't know for sure if this is a capital-B Breakthrough
       | until someone tries scaling it up to parameter and data counts
       | comparable to SOTA models.
       | 
       | One detail I found really interesting is that they seem to do all
       | their calculations in log-space, according to the Appendix. They
       | say it's for numerical stability, which is curious to me--I'm not
       | sure I have a good intuition for why running everything in log-
       | space makes the model more stable. Is it because they removed the
       | tanh from the output, making it possible for values to explode if
       | calculations are done in linear space?
       | 
       | EDIT: Another thought--it's kind of fascinating that this sort of
       | sequence modeling works at all. It's like if I gave you all the
       | pages of a book individually torn out and in a random order, and
       | asked you to try to make a vector representation for each page as
       | well as instructions for how to mix that vector with the vector
       | representing all previous pages -- except you have zero knowledge
       | of those previous pages. Then, I take all your page vectors,
       | sequentially mix them together in-order, and grade you based on
       | how good of a whole-book summary the final vector represents.
       | Wild stuff.
       | 
       | FURTHER EDIT: Yet _another_ thought--right now, they 're just
       | using two dense linear layers to transform the token into the
       | proposed hidden state and the lerp mix factors. I'm curious what
       | would happen if you made those transforms MLPs instead of
       | singular linear layers.
        
         | immibis wrote:
         | This architecture, on the surface, seems to preclude the basic
         | function of recognizing sequences of tokens. At the very least,
         | it seems like it should suffer from something like the pumping
         | lemma: if [the ][cat ][is ][black ] results in the output
         | getting close to a certain vector, [the ][cat ][is ][black
         | ][the ][cat ][is ][black ][the ][cat ][is ][black ] should get
         | even closer to that vector and nowhere close to a "why did you
         | just repeat the same sentence three times" vector? Without non-
         | linear mixing between input token and hidden state, there will
         | be a lot of linear similarities between similar token
         | sequences...
        
           | mkaic wrote:
           | Counterpoint: the hidden state at the beginning of
           | ([the][cat][is][black]) x 3 is (probably) initialized to all
           | zeros, but after seeing those first 4 tokens, it will _not_
           | be all zeros. Thus, going into the second repetition of the
           | sentence, the model has a different initial hidden state, and
           | should exhibit different behavior. I think this makes it
           | possible for the model to learn to recognize repeated
           | sequences and avoid your proposed pitfall.
        
         | slashdave wrote:
         | Log space is important if the token probabilities span a large
         | range of values (powers). There is a reason that maximum
         | likelihood fitting is always performed with log likelihoods.
        
       | trott wrote:
       | My feeling is that the answer is "no", in the sense that these
       | RNNs wouldn't be able to universally replace Transformers in
       | LLMs, even though they might be good enough in some cases and
       | beat them in others.
       | 
       | Here's why.
       | 
       | A user of an LLM _might_ give the model some long text and then
       | say  "Translate this into German please". A Transformer can look
       | back at its whole history. But what is an RNN to do? While the
       | length of its context is unlimited, the amount of information the
       | model retains about it is bounded by whatever is in its hidden
       | state at any given time.
       | 
       | Relevant: https://arxiv.org/abs/2402.01032
        
         | mkaic wrote:
         | The counterargument here is that you can just scale the size of
         | the hidden state sufficiently such that it can hold compressed
         | representations of whatever-length sequence you like.
         | Ultimately, what I care about is whether RNNs could compete
         | with transformers if FLOPs are held constant--something TFA
         | doesn't really investigate.
        
           | psb217 wrote:
           | Well, that's what Transformer already does... One problem
           | with the scaling you're describing is that there would be a
           | massive amount of redundant information stored in hidden
           | activations during training the RNN. The hidden state at each
           | time step t in the sequence would need to contain all info
           | that (i) could be useful for predicting the token at time t
           | and (ii) that could be useful for predicting tokens at times
           | >t. (i) is obvious and (ii) is since all information about
           | the past is transferred to future predictions through the
           | current hidden state. In principle, Transformers can avoid
           | storing redundant info in multiple hidden states at the cost
           | of having to maintain and access (via attention) a larger
           | hidden state at test/eval time.
        
             | mkaic wrote:
             | > there would be a massive amount of redundant information
             | stored in hidden activations
             | 
             | Is there a way to prove this? One potential caveat that
             | comes to mind for me is that perhaps the action of lerping
             | between the old state and the new could be used by the
             | model to perform semantically meaningful transformations on
             | the old state. I guess in my mind it just doesn't seem
             | obvious that the hidden state is necessarily a collection
             | of "redundant information" -- perhaps the information is
             | culled/distilled the further along in the sequence you go?
             | There will always be _some_ redundancy, sure, but I don 't
             | think that such redundancy necessarily means we _have_ to
             | use superlinear methods like attention.
        
         | phkahler wrote:
         | >> A user of an LLM might give the model some long text and
         | then say "Translate this into German please". A Transformer can
         | look back at its whole history.
         | 
         | Which isn't necessary. If you say "translate the following to
         | german." Instead, all it needs is to remember the task at hand
         | and a much smaller amount of recent input. Well, and the
         | ability to output in parallel with processing input.
        
           | og_kalu wrote:
           | It's necessary for arbitrary information processing if you
           | can forget and have no way to "unforget".
           | 
           | A model can decide to forget something that turns out to be
           | important for some future prediction. A human can go back and
           | re-read/listen etc, A transformer is always re-reading but a
           | RNN can't and is fucked.
        
             | magicalhippo wrote:
             | That's just because we twisted it's arm. One could for
             | example feed the reversed input after, ie abc|cba where |
             | is a special token. That would allow it to react to any
             | part of the message.
        
           | trott wrote:
           | People did something similar to what you are describing 10
           | years ago: https://arxiv.org/abs/1409.0473
           | 
           | But it's trained on translations, rather than the whole
           | Internet.
        
           | DoctorOetker wrote:
           | Also, a lightweight network could do a first pass to identify
           | tasks, instructions, constraints etc, and then a second pass
           | could use the RNN.
           | 
           | Consider the flood fill algorithm or union-find algorithm,
           | which feels magical upon first exposure.
           | 
           | https://en.wikipedia.org/wiki/Hoshen%E2%80%93Kopelman_algori.
           | ..
           | 
           | Having 2 passes can enable so much more than a single pass.
           | 
           | Another alternative could be to have a first pass make notes
           | in a separate buffer while parsing the input. The bandwidth
           | of the note taking and reading can be much much lower than
           | that required for fetching the billions of parameters.
        
         | slashdave wrote:
         | > the amount of information the model retains about it is
         | bounded by whatever is in its hidden state
         | 
         | This is no different than a transformer, which, after all, is
         | bound by a finite state, just organized in a different manner.
        
       | fhdsgbbcaA wrote:
       | We really need a [preprint] flag for unreviewed papers.
        
       | limapedro wrote:
       | This is such a interesting paper, sadly they don't have big
       | models, I'd like to see a model trained on TinyStories or even C4
       | since it should be faster than the transformer variant and see
       | how it compares.
        
       | charlescurt123 wrote:
       | I find the entire field lacking when it comes to long-horizon
       | problems. Our current, widely used solution is to scale, but
       | we're nowhere near achieving the horizon scales even small mammal
       | brains can handle. Our models can have trillions of parameters,
       | yet a mouse brain would still outperform them on long-horizon
       | tasks and efficiency. It's something small, simple, and elegant--
       | an incredible search algorithm that not only finds near-optimal
       | routes but also continuously learns on a fixed computational
       | budget.
       | 
       | I'm honestly a bit envious of future engineers who will be
       | tackling these kinds of problems with a 100-line Jupyter notebook
       | on a laptop years from now. If we discovered the right method or
       | algorithm for these long-horizon problems, a 2B-parameter model
       | might even outperform current models on everything except short,
       | extreme reasoning problems.
       | 
       | The only solution I've ever considered for this is expanding a
       | model's dimensionality over time, rather than focusing on perfect
       | weights. The higher dimensionality you can provide to a model,
       | the greater its theoretical storage capacity. This could resemble
       | a two-layer model--one layer acting as a superposition of
       | multiple ideal points, and the other layer knowing how to use
       | them.
       | 
       | When you think about the loss landscape, imagine it with many
       | minima for a given task. If we could create a method that
       | navigates these minima by reconfiguring the model when needed, we
       | could theoretically develop a single model with near-infinite
       | local minima--and therefore, higher-dimensional memory. This may
       | sound wild, but consider the fact that the human brain
       | potentially creates and disconnects thousands of new connections
       | in a single day. Could it be that these connections steer our
       | internal loss landscape between different minima we need
       | throughout the day?
        
       ___________________________________________________________________
       (page generated 2024-10-03 23:00 UTC)