[HN Gopher] Were RNNs all we needed?
       ___________________________________________________________________
        
       Were RNNs all we needed?
        
       Author : beefman
       Score  : 470 points
       Date   : 2024-10-03 17:31 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | tehsauce wrote:
       | I haven't gone through the paper in detail yet but maybe someone
       | can answer. If you remove the hidden state from an rnn as they
       | say they've done, what's left? An mlp predicting from a single
       | token?
        
         | statusfailed wrote:
         | I only had a quick look, but it looks like they tweaked the
         | state update so the model can be run with parallel scan instead
         | of having to do it sequentially.
        
         | jfcoa wrote:
         | It doesn't completely remove it, it removes certain
         | dependencies on it so that it can be computed by parallel scan,
         | there is still a hidden state. It bears some similarity to what
         | was done with Mamba.
        
         | bunderbunder wrote:
         | They didn't remove the hidden state entirely, they just removed
         | it from the input, forget and update gates. I haven't digested
         | the paper either, but I think that in the case of a GRU this
         | means that the hidden state update masking (z_t and r_t in the
         | paper's formulas) only depends on the new input, not the input
         | plus the prior hidden state.
        
         | _0ffh wrote:
         | The trick is to make sure the recursive dependency stays
         | linear, that's how you enable parallel training.
        
       | hydrolox wrote:
       | Betteridge's law of headlines?
        
         | woah wrote:
         | For paper titles, the law is that the answer is always "yes"
        
           | bunderbunder wrote:
           | Not always, I think?
           | 
           | Opinions probably differ, for example, on John Backus's paper
           | "Can programming be liberated from the Von Neumann style?"
           | Many fans of functional programming would say the answer is
           | yes, but Backus himself expressed less enthusiasm in
           | interviews later in his life.
           | 
           | I think the important point, though, is that academic papers
           | and newspaper articles are _not the same_ , and titles in the
           | form of questions function differently in the two domains.
           | Journalists tend to use titles like these to dissemble and
           | sensationalize. When academics use these kinds of titles for
           | peer-reviewed articles, it's because they really are asking
           | an honest question. Backus was doing it in his paper. The
           | authors of this paper are doing the same. They end the paper
           | by re-iterating the question before launching into a
           | discussion of the limitations that prevent them from reaching
           | any firm conclusions on the answer to this question.
        
           | nephanth wrote:
           | More like "we aren't sure, but we have good reasons not to
           | exclude the possibility"
        
       | hiddencost wrote:
       | Note Yoshua Bengio in the author list. This shouldn't be taken
       | lightly.
        
         | auggierose wrote:
         | And this is where science breaks down.
        
           | hotspot_one wrote:
           | Not really, because
           | 
           | 1) Yoshua's reputation would take a hit if this paper were
           | bullshit, so he has extrinsic motivation to make it good 2)
           | Yoshua has enough experience in the field to know what is
           | going on in the field, you don't have to ask if he forgot
           | about a certain architecture or the work of a certain
           | research group which would contradict his findings-- if such
           | work exists and is credible, it is very likely to be
           | discussed in the paper. 3) This test answers something a
           | leader in the field thinks is important enough for them to
           | work on, else he wouldn't be involved.
           | 
           | Also note, the poster said the paper shouldn't be taken
           | lightly. That doesn't mean we need to take it blindly. It
           | only means we cannot dismiss it out of hand, if we have a
           | different view we would need substantive arguments to defend
           | our view.
           | 
           | I've overturned the field leader several times in science,
           | but that's only because I acknowledged what they got right
           | and that they were indeed the person who got it right.
        
             | DAGdug wrote:
             | " I've overturned the field leader several times in
             | science" Either that makes you a field leader yourself, or
             | you did it for trivial things, or you're BSing. Which one
             | is it?
        
               | exe34 wrote:
               | there's a big space between leader and trivial. it's
               | entirely possible to point out the top leader in your
               | field is wrong on ten things over a career, without
               | becoming the top leader yourself.
        
               | DAGdug wrote:
               | On speculative things or trivial things, sure! On
               | substantive matters (recall: the choice of words is
               | "overturned"), in empirical realms or theory (physics,
               | CS) or math, it's rather doubtful. Anonymous, self-
               | declared geniuses aren't to be taken at face value.
        
               | exe34 wrote:
               | > Anonymous, self-declared geniuses aren't to be taken at
               | face value.
               | 
               | no, that would be a grievous mistake on an anonymous
               | site.
        
             | auggierose wrote:
             | > It only means we cannot dismiss it out of hand, if we
             | have a different view we would need substantive arguments
             | to defend our view.
             | 
             | You will need to do that anyway, no matter if Yoshua is on
             | the paper, or not. I understand that people have limited
             | bandwidth, and so they need shortcuts, and they need to
             | justify these shortcuts to themselves somehow (of course
             | the justifications are nonsense). Maybe AI will help here.
        
         | _giorgio_ wrote:
         | Who cares. Look at Jeffrey Hinton right now. Do you trust him?
         | :-D
        
       | imjonse wrote:
       | To their credit, the authors (Y. Bengio among them) end the paper
       | with the question, not suggesting they know the answer. These
       | models are very small even by academic standards so any finding
       | would not necessarily extend to current LLM scales. The main
       | conclusion is that RNN class networks can be trained as
       | efficiently as modern alternatives but the resulting performance
       | is only competitive at small scale.
        
         | phkahler wrote:
         | >> These models are very small even by academic standards so
         | any finding would _not necessarily_ extend to current LLM
         | scales.
         | 
         | Emphasis on not necessarily.
         | 
         | >> The main conclusion is that RNN class networks can be
         | trained as efficiently as modern alternatives but the resulting
         | performance is only competitive at small scale.
         | 
         | Shouldn't the conclusion be "the resulting competitive
         | performance has only been confirmed at small scale"?
        
           | imjonse wrote:
           | yes, that is clearer indeed. However S4 and Mamba class
           | models have also performed well at small scale and started
           | lagging with larger models and larger context sizes, or at
           | particular tasks.
        
       | xnx wrote:
       | It's curse and a blessing that discussion of topics happens in so
       | many different places. I found this comment on Twitter/X
       | interesting: https://x.com/fchollet/status/1841902521717293273
       | 
       | "Interesting work on reviving RNNs.
       | https://arxiv.org/abs/2410.01201 -- in general the fact that
       | there are many recent architectures coming from different
       | directions that roughly match Transformers is proof that
       | architectures aren't fundamentally important in the curve-fitting
       | paradigm (aka deep learning)
       | 
       | Curve-fitting is about embedding a dataset on a curve. The
       | critical factor is the dataset, not the specific hard-coded bells
       | and whistles that constrain the curve's shape. As long as your
       | curve is sufficiently expressive all architectures will converge
       | to the same performance in the large-data regime."
        
         | islewis wrote:
         | > "As long as your curve is sufficiently expressive all
         | architectures will converge to the same performance in the
         | large-data regime."
         | 
         | I haven't fully ingested the paper yet, but it looks like it's
         | focused more on compute optimization than the size of the
         | dataset:
         | 
         | > ... and (2) are fully parallelizable during training (175x
         | faster for a sequence of length 512
         | 
         | Even if many types of architectures converge to the same loss
         | over time, finding the one that converges the fastest is quite
         | valuable given the cost of running GPU's at scale.
        
           | teruakohatu wrote:
           | > Even if many types of architectures converge to the same
           | loss over time, finding the one that converges the fastest is
           | quite valuable given the cost of running GPU's at scale.
           | 
           | This! Not just fastest but with the lowest resources in
           | total.
           | 
           | Fully connected neural networks are universal functions.
           | Technically we don't need anything but a FNN, but memory
           | requirements and speed would be abysmal far beyond the realm
           | of practicality.
        
             | actionfromafar wrote:
             | Unless we could build chips in 3D?
        
               | foota wrote:
               | Not even then, a truly fully connected network would have
               | super exponential runtime (it would take N^N time to
               | evaluate)
        
               | ivan_gammel wrote:
               | We need quantum computing there. I remember seeing a
               | recent article about quantum processes in the brain. If
               | that's true, QC may be the missing part.
        
               | eru wrote:
               | Compare and contrast https://www.smbc-
               | comics.com/comic/the-talk-3
               | 
               | (Summary: quantum computing is unlikely to help.)
        
               | tsimionescu wrote:
               | This is just word salad.
               | 
               | There is no known quantum algorithm that can compute the
               | result of a fully-connected neural network exponentially
               | faster than classical computers can. QCs have a known
               | exponential advantage over classical computers only for a
               | very limited class of problems, mostly related to the
               | Quantum Fourier Transform.
               | 
               | Animal brains have little to nothing in common to
               | artifical neural networks. There is no reason whatsoever
               | to think that there is any relation between the
               | complexity class of brain functions and ANN inference.
               | 
               | And the hypothesized (and still wildly speculative)
               | quantum behaviors happening in the animal brain are at
               | the level of the behavior of individual neurons, not of
               | the network connections between neurons. So even if there
               | is some kind of quantum computation happening, it's
               | happening in individual neurons, not at the network
               | level, and that would only go to show even more that
               | animal brains are profoundly different from ANNs.
        
               | mvkel wrote:
               | Wetware is the future.
        
               | fennecfoxy wrote:
               | Can't wait to see this defiantly spray painted across a
               | torn up brick wall while computronium brained super
               | intelligences slowly disassemble our planet to make
               | paperclips.
        
               | ComputerGuru wrote:
               | Heat extraction.
        
               | bob1029 wrote:
               | We are already doing this.
        
           | byearthithatius wrote:
           | > finding the one that converges the fastest is quite
           | valuable given the cost of running GPU's at scale
           | 
           | Not to him, he runs the ARC challenge. He wants a new
           | approach entirely. Something capable of few-shot learning out
           | of distribution patterns .... somehow
        
         | acchow wrote:
         | What it will come down to is computational efficiencies. We
         | don't want to retrain once a month - we want to retrain
         | continuously. We don't want one agent talking to 5 LLMs. We
         | want thousands of LLMs all working in concert.
        
           | ActorNightly wrote:
           | This and also the way models are trained has to be rethought.
           | BPP is good for figuring out complex function mappings, but
           | not for storing information.
        
           | pbhjpbhj wrote:
           | Sounds like something that has unsustainable energy costs.
        
         | Lerc wrote:
         | I remember one of the initial transformer people saying in an
         | interview that they didn't think this was the "one true
         | architecture" but a lot of the performance came from people
         | rallying around it and pushing in the one direction.
         | 
         | On the other hand, while _" As long as your curve is
         | sufficiently expressive all architectures will converge to the
         | same performance in the large-data regime."_ is true, a
         | sufficiently expressive mechanism may not be computationally or
         | memory efficient. As both are constraints on what you can
         | actually build, it's not whether the architecture can produce
         | the result, but whether a feasible/practical instantiation of
         | that architecture can produce the result.
        
           | viktor_von wrote:
           | > I remember one of the initial transformer people saying in
           | an interview that they didn't think this was the "one true
           | architecture" but a lot of the performance came from people
           | rallying around it and pushing in the one direction.
           | 
           | You may be referring to Aidan Gomez (CEO of Cohere and
           | contributor to the transformer architecture) during his
           | Machine Learning Street Talk podcast interview. I agree, if
           | as much attention had been put towards the RNN during the
           | initial transformer hype, we may have very well seen these
           | advancements earlier.
        
         | ants_everywhere wrote:
         | > is proof that architectures aren't fundamentally important in
         | the curve-fitting paradigm (aka deep learning)
         | 
         | (Somewhat) fun and (somewhat) related fact: there's a whole
         | cottage industry of "is all you need" papers
         | https://arxiv.org/search/?query=%22is+all+you+need%22&search...
        
           | TaurenHunter wrote:
           | Reminds me of the "Considered Harmful" articles:
           | 
           | https://meyerweb.com/eric/comment/chech.html
        
             | jprete wrote:
             | I wonder if there's something about tech culture - or tech
             | people - that encourages them to really, really like
             | snowclones.
        
               | observationist wrote:
               | Yes. Do stuff that other people have been successful
               | doing. Monkey see, monkey do - it's not a tech people
               | thing, it's a human thing.
               | 
               | Tech just happens to be most on display at the moment -
               | because tech people are building the tools and the
               | parameters and the infrastructure handling all our
               | interactions.
        
               | fennecfoxy wrote:
               | Not sure why people are surprised about this when it's
               | the modus operandi of all life on the planet.
               | 
               | I could spam we are the stochastic parrots after all, yet
               | one more time.
        
             | bee_rider wrote:
             | Quick, somebody write "All you need Considered Harmful" and
             | "Considered Harmful all you need."
             | 
             | Which seems closer to true?
        
               | cozzyd wrote:
               | All you need is all you need.
        
           | tsimionescu wrote:
           | Starting of course with the classic paper from Lennon and
           | McCartney, 1967.
        
         | wongarsu wrote:
         | One big thing that bells and whistles do is limit the training
         | space.
         | 
         | For example when CNNs took over computer vision that wasn't
         | because they were doing something that dense networks couldn't
         | do. It was because they removed a lot of edges that didn't
         | really matter, allowing us to spend our training budget on
         | deeper networks. Similarly transformers are great because they
         | allow us to train gigantic networks somewhat efficiently. And
         | this paper finds that if we make RNNs a lot faster to train
         | they are actually pretty good. Training speed and efficiency
         | remains the big bottleneck, not the actual expressiveness of
         | the architecture
        
           | nutanc wrote:
           | This is true. This is the reason, in many of our experiments
           | we find that using a new algorithm, KESieve, we actually find
           | the planes much faster than the traditional deep learning
           | training approaches. The premise is, a neaural network builds
           | planes which separate the data and adjusts these planes
           | through an iterative learning process. What if we can find a
           | non iterative method which can draw these same planes. We
           | have been trying this and so far we have been able to replace
           | most network layers using this approach. haven't tried for
           | transformers though yet.
           | 
           | Some links if interested:
           | 
           | [1] https://gpt3experiments.substack.com/p/understanding-
           | neural-...
           | 
           | [2] https://gpt3experiments.substack.com/p/building-a-vector-
           | dat...
        
         | dheera wrote:
         | I mean, transformer-based LLMs are RNNs, just really really
         | really big ones with very wide inputs that maintain large
         | amounts of context.
        
           | immibis wrote:
           | No. An RNN has an arbitrarily-long path from old inputs to
           | new outputs, even if in practice it can't exploit that path.
           | Transformers have fixed-size input windows.
        
             | og_kalu wrote:
             | You can't have a fixed state and have arbitrarily-long path
             | from input. Well you can but then it's just meaningless
             | because you fundamentally cannot keep stuffing information
             | of arbitrary length into a fixed state. RNNs effectively
             | have fixed-size input windows.
        
               | immibis wrote:
               | The _path_ is arbitrarily _long_ , not wide. It is
               | _possible_ for an RNN to be made that remembers the first
               | word of the input, no longer how long the input is. This
               | is not possible with a transformer, so we know they are
               | fundamentally different.
        
               | quotemstr wrote:
               | But an RNN isn't _going_ to remember the first token of
               | input. It won 't know until it sees the last token
               | whether that first token was relevant after all, so it
               | has to learn token-specific update rules that let it
               | guess how long to hold what kinds of information. (In
               | multi-layer systems, the network uses ineffable
               | abstractions rather than tokens, but the same idea
               | applies.)
               | 
               | What the RNN must be doing reminds me of "sliding window
               | attention" --- the model learns how to partition its
               | state between short- and long-range memories to minimize
               | overall loss. The two approaches seem related, perhaps
               | even equivalent up to implementation details.
        
               | OkayPhysicist wrote:
               | The most popular RNNs (the ones that were successful
               | enough for Google translate and the like) actually had
               | this behavior baked in to the architecture, called
               | "LSTMs", "Long-Short Term Memory"
        
             | dheera wrote:
             | A chunk of the output still goes into the transformer
             | input, so the arbitrarily-long path still exists, it just
             | goes through a decoding/encoding step.
        
             | WithinReason wrote:
             | no, you can give as much context to a transformer as you
             | want, you just run out of memory
        
               | immibis wrote:
               | An RNN doesn't run out of memory from that, so they are
               | still fundamentally different.
               | 
               | How do you encode arbitrarily long positions, anyway?
        
               | WithinReason wrote:
               | They are different but transformers don't have fixed
               | windows, you can extend the context or make it smaller. I
               | think you can extend a positional encoding if it's not a
               | learned encoding.
        
         | quantadev wrote:
         | Most LLMs aren't even using a "curve" yet at all, right? All
         | they're using is a series of linear equations because the model
         | weights are a simple multiply and add (i.e. basic NN
         | Perceptron). Sure there's a squashing function on the output to
         | keep it in a range from 0 to 1 but that's done BECAUSE we're
         | just adding up stuff.
         | 
         | I think probably future NNs will be maybe more adaptive than
         | this perhaps where some Perceptrons use sine wave functions, or
         | other kinds of math functions, beyond just linear "y=mx+b"
         | 
         | It's astounding that we DID get the emergent intelligence from
         | just doing this "curve fitting" onto "lines" rather than actual
         | "curves".
        
           | OkayPhysicist wrote:
           | The "squashing function" necessarily is nonlinear in
           | multilayer nueral networks. A single layer of a neural
           | network can be quite simply written a weight matrix, times an
           | input vector, equalling an output vector, like so
           | 
           | Ax = y
           | 
           | Adding another layer is just multiplying a different set of
           | weights times the output of the first, so
           | 
           | B(Ax)= y
           | 
           | If you remember your linear algebra course, you might see the
           | problem: that can be simplified
           | 
           | (BA)x = y
           | 
           | Cx = y
           | 
           | Completely indistinguishable from a single layer, thus only
           | capable of modeling linear relationships.
           | 
           | To prevent this collapse, a non linear function must be
           | introduced between each layer.
        
             | quantadev wrote:
             | Right. All the squashing is doing is keeping the output of
             | any neuron in a range of below 1.
             | 
             | But the entire NN itself (Perceptron ones, which most LLMs
             | are) is still completely using nothing but linearity to
             | store all the knowledge from the training process. All the
             | weights are just an 'm' in the basic line equation
             | 'y=m*x+b'. The entire training process does nothing but
             | adjust a bunch of slopes of a bunch of lines. It's totally
             | linear. No non-linearity at all.
        
               | nazgul17 wrote:
               | The non linearities are fundamental. Without them, any
               | arbitrarily deep NN is equivalent to a shallow NN (easily
               | computable, as GP was saying), and we know those can't
               | even solve the XOR problem.
               | 
               | > nothing but linearity
               | 
               | No, if you have non linearities, the NN itself is _not_
               | linear. The non linearities are not there primarily to
               | keep the outputs in a given range, though that 's
               | important, too.
        
               | quantadev wrote:
               | > The non linearities are not there primarily to keep the
               | outputs in a given range
               | 
               | Precisely what the `Activation Function` does is to
               | squash an output into a range (normally below one, like
               | tanh). That's the only non-linearity I'm aware of. What
               | other non-linearities are there?
               | 
               | All the training does is adjust linear weights tho, like
               | I said. All the training is doing is adjusting the slopes
               | of lines.
        
               | jcparkyn wrote:
               | > squash an output into a range
               | 
               | This isn't the primary purpose of the activation
               | function, and in fact it's not even necessary. For
               | example see ReLU (probably the most common activation
               | function), leaky ReLU, or for a sillier example:
               | https://youtu.be/Ae9EKCyI1xU?si=KgjhMrOsFEVo2yCe
        
               | quantadev wrote:
               | You can change the subject by bringing up as many
               | different NN architectures, Activation Functions, etc. as
               | you want. I'm telling you the basic NN Perceptron design
               | (what everyone means when they refer to Perceptrons in
               | general), has something like a `tanh` and not only is
               | it's PRIMARY function to squash a number, that's it's
               | ONLY function.
        
               | beckhamc wrote:
               | How was that person derailing the convo? Nothing says an
               | activation function has to "squash" a number to be in
               | some range. Leaky ReLUs for instance do `f(x) = x if x >
               | 0 else ax` (for some coefficient `a != 0`), that doesn't
               | squash `x` to be in any range (unless you want to be
               | peculiar about your precise definition of what it means
               | to squash a number). The function takes a real in `[-inf,
               | inf]` and produces a number in `[-inf, inf]`.
               | 
               | > Sure there's a squashing function on the output to keep
               | it in a range from 0 to 1 but that's done BECAUSE we're
               | just adding up stuff.
               | 
               | It's not because you're "adding up stuff", there is
               | specific mathematical or statistical reason why it is
               | used. For neural networks it's there to stop your multi
               | layer network collapsing to a single layer one (i.e. a
               | linear algebra reason). You can choose whatever function
               | you want, for hidden layers tanh generally isn't used
               | anymore, it's usually some variant of a ReLU. In fact
               | Leaky ReLUs are very commonly used so OP isn't changing
               | the subject.
               | 
               | If you define a "perceptron" (`g(Wx+b)` and `W` is a
               | `Px1` matrix) and train it as a logistic regression model
               | then you want `g` to be sigmoid. Its purpose is to ensure
               | that the output can be interpreted as a probability
               | (given that use the correct statistical loss), which
               | means squashing the number. The inverse isn't true, if I
               | take random numbers from the internet and squash them to
               | `[0,1]` I don't go call them probabilities.
               | 
               | > and not only is it's PRIMARY function to squash a
               | number, that's it's ONLY function.
               | 
               | Squashing the number isn't the reason, it's the side
               | effect. And even then, I just said that not all
               | activation functions squash numbers.
               | 
               | > All the training does is adjust linear weights tho,
               | like I said.
               | 
               | Not sure what your point is. What is a "linear weight"?
               | 
               | We call layers of the form `g(Wx+b)` "linear" layers but
               | that's an abused term, if g() is non-linear then the
               | output is not linear. Who cares if the inner term `Wx +
               | b` is linear? With enough of these layers you can
               | approximate fairly complicated functions. If you're
               | arguing as to whether there is a better fundamental
               | building block then that is another discussion.
        
               | quantadev wrote:
               | > What is a "linear weight"?
               | 
               | In the context of discussing linearity v.s. non-linearity
               | adding the word "linear" in front of "weight" is more
               | clear, which is what my top level post on this thread was
               | all about too.
               | 
               | It's astounding to me (and everyone else who's being
               | honest) that LLMs can accomplish what they do when it's
               | only linear "factors" (i.e. weights) that are all that's
               | required to be adjusted during training, to achieve
               | genuine reasoning. During training we're not [normally]
               | adjusting any parameters or weights on any non-linear
               | functions. I include the caveat "normally", because I'm
               | speaking of the basic Perceptron NN using a squashing-
               | type activation function.
        
               | viktor_von wrote:
               | > It's astounding to me (and everyone else who's being
               | honest) that LLMs can accomplish what they do when it's
               | only linear "factors" (i.e. weights) that are all that's
               | required to be adjusted during training, to achieve
               | genuine reasoning.
               | 
               | When such basic perceptrons are scaled enormously, it
               | becomes less surprising that they can achieve some level
               | of 'genuine reasoning' (e.g., accurate next-word
               | prediction), since the goal with such networks at the end
               | of the day is just function approximation. What is more
               | surprising to me is how we found ways to train such
               | models i.e., advances in hardware accelerators, combined
               | with massive data, which are factors just as significant
               | in my opinion.
        
               | quantadev wrote:
               | Yeah, no one is surprised that LLMs do what they're
               | trained to do: predict tokens. The surprise comes from
               | the fact that merely training to predict tokens ends up
               | with model weights that generate emergent reasoning.
               | 
               | If you want to say reasoning and token prediction are
               | just the same thing at scale you can say that, but I
               | don't fall into that camp. I think there's MUCH more to
               | learn, and indeed a new field of math or even physics
               | that we haven't even discovered yet. Like a step change
               | in mathematical understanding analogous to the invention
               | of Calculus.
        
               | mr_toad wrote:
               | You need a non-linear activation function for the
               | universal approximation theorem to hold. Otherwise, as
               | others have said the model just collapses to a single
               | layer.
               | 
               | Technically the output is still what a statistician would
               | call "linear in the parameters", but due to the universal
               | approximation theorem it can _approximate_ any non-linear
               | function.
               | 
               | https://stats.stackexchange.com/questions/275358/why-is-
               | incr...
        
               | quantadev wrote:
               | As you can see in what I just posted about an inch below
               | this, my point is that the process of training a NN does
               | not involve adjusting any parameter to any non-linear
               | functions. What goes into an activation function is a
               | pure sum of linear multiplications and an add, but
               | there's no "tunable" parameter (i.e. adjusted during
               | training) that's fed into the activation function.
        
               | beckhamc wrote:
               | Learnable parameters on activations _do_ exist, look up
               | parametric activation functions.
        
               | quantadev wrote:
               | If course they do exist. A parameterized activation
               | function is the most obvious thing to _try_ in NN design,
               | and has certainly been invented /studied by 1000s of
               | researchers.
        
               | uh_uh wrote:
               | > That's the only non-linearity I'm aware of.
               | 
               | "only" is doing a lot work here because that non-
               | linearity is enough to vastly expand the landscape of
               | functions that an NN can approximate. If the NN was
               | linear, you could greatly simplify the computational
               | needs of the whole thing (as was implied by another
               | commenter above) but you'd also not get a GPT out of it.
        
               | quantadev wrote:
               | All the trainable parameters are just slopes of lines
               | tho. Training NNs doesn't involve adjusting any inputs to
               | non-linear functions. The tanh smashing function just
               | makes sure nothing can blow up into large numbers and all
               | outputs are in a range of less than 1. There's no "magic"
               | or "knowledge" in the tanh smashing. All the magic is
               | 100% in the weights. They're all linear. The amazing
               | thing is that all weights are linear slopes of lines.
        
               | Nevermark wrote:
               | Simply squashing the output of a linear signal would be
               | multiplying by a small value. To avoid large y, you add a
               | step y' = y/1000.
               | 
               | That would still be linear. And the result would be that
               | despite squashing, no matter how many layers a model had,
               | it could only fit linear problems. Which can always be
               | fit with a single layer, i.e. single matrix.
               | 
               | So nobody does that.
               | 
               | The nonlinearity doesn't just squash some inputs. But
               | create a new rich feature, decision making. That's
               | because on one side of a threshold y gets converted very
               | differently than another. I.e if y > 0, y' = y, otherwise
               | y = 0.
               | 
               | Now you have a discontinuity in behavior, you have a
               | decision.
               | 
               | Multiple layers making decisions can do far more than a
               | linear layer. They can fit any continuous function (or
               | any function with a finite number of discontinuities)
               | arbitrarily well.
               | 
               | Non-linearities add a fundamental new feature. You can
               | think of that features as being able to make decisions
               | around the non-linear function's decision points.
               | 
               | ---
               | 
               | If you need to prove this to yourself with a simple
               | example, try to create an XOR gate with this function:
               | y = w1 * x1 + w2 * x2 + b.
               | 
               | Where you can pick w1, w2 and b.
               | 
               | You are welcome to linearly squash the output, i.e. y' =
               | y * w3, for whatever small w3 you like. It won't help.
               | 
               | Layers with non-linear transformations are layers of
               | decision makers.
               | 
               | Layers of linear transforms are just unnecessarily long
               | ways of writing a single linear transform. Even with
               | linear "squashing".
        
               | quantadev wrote:
               | Right, it's obvious that the ReLU is just a gating
               | mechanism, and you can think of that as a decision maker.
               | It's like a "pass thru linearly proportionally" or
               | "block" function.
               | 
               | But I still find it counter-intuitive that it's not
               | common practice in standard LLM NNs to have a trainable
               | parameter that in some way directly "tunes" whatever
               | Activation Function is being applied on EACH output.
               | 
               | For example I almost started experimenting with
               | trigonometric activation functions in a custom NN where
               | the phase angle would be adjusted, inspired by Fourier
               | Series. I can envision a type of NN where every model
               | "weight" is actually a frequency component, because
               | Fourier Series can represent any arbitrary function in
               | this way. There has of course already been similar
               | research done by others along these lines.
        
               | uh_uh wrote:
               | > The tanh smashing function just makes sure nothing can
               | blow up into large numbers and all outputs are in a range
               | of less than 1.
               | 
               | That's not the main point even though it probably helps.
               | As OkayPhysicist said above, without a nonlinearity, you
               | could collapse all the weight matrices into a single
               | matrix. If you have 2 layers (same size, for simplicity)
               | described by weight matrices A and B, you could multiply
               | them and get C, which you could use for inference.
               | 
               | Now, you can do this same trick not only with 2 layers
               | but 100 million, all collapsing into a single matrix
               | after multiplication. If the nonlinearities weren't
               | there, the effective information content of the whole NN
               | would collapse into that of a single-layer NN.
        
               | quantadev wrote:
               | You can explain the "effect" of tanh at any level of
               | abstraction you like, up to including describing things
               | that happen in Semantic Space itself, but my description
               | of what tanh is doing is 100% accurate in the context I
               | used it. All it's doing is squashing a number down to
               | below one. My understanding of how the Perceptron works
               | is fully correct, and isn't missing any details. I've
               | implemented many of them.
        
               | beckhamc wrote:
               | Your description of tanh isn't even correct, it squashes
               | a real number to `(-1, 1)`, not "less than one".
               | 
               | You're curious about whether there is gain in
               | parameterising activation functions and learning them
               | instead, or rather, why it's not used much in practice.
               | That's an interesting and curious academic question, and
               | it seems like you're already experimenting with trying
               | out your own kinds of activation functions. However,
               | people in this thread (including myself) wanted to
               | clarify some perceived misunderstandings you had about
               | nonlinearities and "why" they are used in DNNs. Or how
               | "squashing functions" is a misnomer because `g(x) =
               | x/1000` doesn't introduce any nonlinearities. Yet you
               | continue to fixate and double down on your knowledge of
               | "what" a tanh is, and even that is incorrect.
        
               | quantadev wrote:
               | When discussing `tanh squashing` among other AI experts
               | it's generally assumed that even the most pedantic and
               | uncharitable parsing of words won't be able to
               | misinterpret "smashing to less than one" as an
               | _incorrect_ sentence fragment, because the  "one", in
               | that context, obviously refers to distance from zero.
        
               | wrs wrote:
               | With a ReLU activation function, rather than a simple
               | linear function of the inputs, you get a _piecewise
               | linear approximation_ of a nonlinear function.
               | 
               | ReLU enables this by being nonlinear in a simple way,
               | specifically by outputting zero for negative inputs, so
               | each linear unit can then limit its contribution to a
               | portion of the output curve.
               | 
               | (This is a lot easier to see on a whiteboard!)
        
               | quantadev wrote:
               | ReLU technically has a non-linearity at zero, but in some
               | sense it's still even MORE linear than tanh or sigmoid,
               | so it just demonstrates even better than tanh-type
               | squashing that all this LLM stuff is being done
               | ultimately with straight line math. All a ReLU function
               | does is choose which line to use, a sloped one or a zero
               | one.
        
               | wrs wrote:
               | Well. The word "linear" the way you use it doesn't seem
               | to have any particular meaning, certainly not the
               | standard mathematical meaning, so I'm not sure we can
               | make further progress on this explanation.
               | 
               | I'll just reiterate that the single "technical" (whatever
               | that means) nonlinearity in ReLU is exactly what lets a
               | layer approximate any continuous[*] function.
               | 
               | [*] May have forgotten some more adjectives here needed
               | for full precision.
        
               | quantadev wrote:
               | If you're confused just show a tanh graph and a ReLU
               | graph to a 7 year old child and ask which one is linear.
               | They'll all get it right. So you're not confused in the
               | slightest bit about anything I've said. There's nothing
               | even slightly confusing about saying a ReLU is made of
               | two lines.
        
               | mickg10 wrote:
               | I.e. ReLU is _piecewise_ linear. That discontinuity that
               | separates the 2 pieces is precisely what makes it non
               | linear. Which is what enables the actual universal
               | approximation.
        
               | quantadev wrote:
               | Which is what I said two replies ago.
               | 
               | Followed by "in some sense it's [ReLU] still even MORE
               | linear than tanh or sigmoid functions are". There's no
               | way you misunderstood that sentence, or took it as my
               | "definition" of linearity...so I guess you just wanted to
               | reaffirm I was correct, again, so thanks.
        
               | scarmig wrote:
               | Nonlinearity somewhere is fundamental, but it doesn't
               | need to be between each layer. You can, for instance,
               | project each input to a higher dimensional space with a
               | nonlinearity, and the problem becomes linearly separable
               | with high probability (cf Cover's Theorem).
               | 
               | So, for XOR, (x, y) -> (x, y, xy), and it becomes trivial
               | for a linear NN to solve.
               | 
               | Architectures like Mamba have a linear recurrent state
               | space system as their core, so even though you need a
               | nonlinearity somewhere, it doesn't need to be pervasive.
               | And linear recurrent networks are surprisingly powerful
               | (https://arxiv.org/abs/2303.06349,
               | https://arxiv.org/abs/1802.03308).
        
           | mr_toad wrote:
           | > It's astounding that we DID get the emergent intelligence
           | from just doing this "curve fitting" onto "lines" rather than
           | actual "curves".
           | 
           | In Ye Olden days (the 90's) we used to approximate non-linear
           | models using splines or seperate slopes models - fit by hand.
           | They were still linear, but with the right choice of splines
           | you could approximate a non-linear model to whatever degree
           | of accuracy you wanted.
           | 
           | Neural networks "just" do this automatically, and faster.
        
             | quantadev wrote:
             | In college (BSME) I wrote a computer program to generate
             | cam profiles from Bezier curves. It's just a programming
             | trick to generate curves from straight lines at any level
             | of accuracy you want just by letting the computer take
             | smaller and smaller steps.
             | 
             | It's an interesting concept to think of how NNs might be
             | able to exploit this effect in some way based on straight
             | lines in the weights, because a very small number of points
             | can identify avery precise and smooth curves, where
             | directions on the curve might equate to Semantic Space
             | Vectors.
        
               | quantadev wrote:
               | In fact now that I think about it, for any 3 or more
               | points in Semantic Space, there would necessarily be a
               | "Bezier Path" which would have genuine meaning at every
               | point as a good smooth differentiable path thru higher
               | dimensional space to get from one point to another point
               | while "visiting" all intermediate other points. This has
               | to have a direct use in LLMs in terms of reasoning.
        
         | sakras wrote:
         | I figured this was pretty obvious given that MLPs are universal
         | function approximators. A giant MLP could achieve the same
         | results as a transformer. The problem is the scale - we can't
         | train a big enough MLP. Transformers are a performance
         | optimization, and that's why they're useful.
        
         | ctur wrote:
         | Architecture matters because while deep learning can
         | conceivably fit a curve with a single, huge layer (in theory...
         | Universal approximation theorem), the amount of compute and
         | data needed to get there is prohibitive. Having a good
         | architecture means the theoretical possibility of deep learning
         | finding the right N dimensional curve becomes a practical
         | reality.
         | 
         | Another thing about the architecture is we inherently bias it
         | with the way we structure the data. For instance, take a
         | dataset of (car) traffic patterns. If you only track the date
         | as a feature, you miss that some events follow not just the
         | day-of-year pattern but also holiday patterns. You could learn
         | this with deep learning with enough data, but if we bake it
         | into the dataset, you can build a model on it _much_ simpler
         | and faster.
         | 
         | So, architecture matters. Data/feature representation matters.
        
           | mr_toad wrote:
           | > can conceivably fit a curve with a single, huge layer
           | 
           | I think you need a hidden layer. I've never seen a universal
           | approximation theorem for a single layer network.
        
             | dongecko wrote:
             | I second that thought. There is a pretty well cited paper
             | from the late eighties called "Multilayer Feedforward
             | Networks are Universal Approximators". It shows that a
             | feedforward network with a single hidden layer containing a
             | finite number of neurons can approximate any continuous
             | function. For non continous function additional layers are
             | needed.
        
         | drodgers wrote:
         | > The critical factor is the dataset, not the specific hard-
         | coded bells and whistles that constrain the curve's shape
         | 
         | I have almost the opposite take. We've had a lot of datasets
         | for ages, but all the progress in the last decade has come from
         | advances how curves are architected and fit to the dataset
         | (including applying more computing power).
         | 
         | Maybe there's some theoretical sense in which older models
         | could have solved newer problems just as well if only we
         | applied 1000000x the computing power, so the new models are
         | 'just' an optimisation, but that's like dismissing the
         | importance of complexity analysis in algorithm design, and thus
         | insisting that bogosort and quicksort are equivalent.
         | 
         | When you start layering in normalisation techniques to minimise
         | overfitting, and especially once you start thinking about more
         | agentic architectures (eg. Deep Q Learning, some of the search
         | space design going into OpenAI's o1), then I don't think the
         | just-an-optimisation perspective can hold much water at all -
         | more computing power simply couldn't solve those problems with
         | older architectures.
        
           | eru wrote:
           | I see what you are saying, and I made a similar comment.
           | 
           | However it's still an interesting observation that many
           | architectures can arrive at the same performance (even though
           | the training requirements are different).
           | 
           | Naively, you wouldn't expect eg 'x -> a * x + b' to fit the
           | same data as 'x -> a * sin x + b' about equally well. But
           | that's an observation from low dimensions. It seems once you
           | add enough parameters, the exact model doesn't matter too
           | much for practical expressiveness.
           | 
           | I'm faintly reminded of the Church-Turing Thesis; the
           | differences between different computing architectures are
           | both 'real' but also 'just an optimisation'.
           | 
           | > When you start layering in normalisation techniques to
           | minimise overfitting, and especially once you start thinking
           | about more agentic architectures (eg. Deep Q Learning, some
           | of the search space design going into OpenAI's o1), then I
           | don't think the just-an-optimisation perspective can hold
           | much water at all - more computing power simply couldn't
           | solve those problems with older architectures.
           | 
           | You are right, these normalisation techniques help you
           | economise on training data, not just on compute. Some of
           | these techniques can be done independent of the model, eg
           | augmenting your training data with noise. But some others are
           | very model dependent.
           | 
           | I'm not sure how the 'agentic' approaches fit here.
        
             | refulgentis wrote:
             | > _Naively, you wouldn 't expect_
             | 
             | I, a nave, expected this.
             | 
             | Is multiplication versus sine in the analogy hiding it,
             | perhaps?
             | 
             | I've always pictured it as just "needing to learn" the
             | function terms and the function guts are an abstraction
             | that is learned.
             | 
             | Might just be because I'm a physics dropout with a bunch of
             | whacky half-remembered probably-wrong stuff about how any
             | function can be approximated by ex. fourier series.
        
               | eru wrote:
               | So (most) neural nets can be seen as a function of a
               | _fixed_ form with some inputs and lots and lots of
               | parameters.
               | 
               | In my example, a and b were the parameters. The kinds of
               | data you can approximate well with a simple sine wave and
               | the kinds of data you can approximate with a straight
               | line are rather different.
               | 
               | Training your neural net only fiddles with the parameters
               | like a and b. It doesn't do anything about the shape of
               | the function. It doesn't change sine into multiplication
               | etc.
               | 
               | > [...] about how any function can be approximated by ex.
               | fourier series.
               | 
               | Fourier series are an interesting example to bring up! I
               | think I see what you mean.
               | 
               | In theory they work well to approximate any function over
               | either a periodic domain or some finite interval. But
               | unless you take special care, when you apply Fourier
               | analysis naively it becomes extremely sensitive to errors
               | in the phase parameters.
               | 
               | (Special care could eg be done by hacking up your input
               | domain into 'boxes'. That works well for eg audio or
               | video compression, but gives up on any model
               | generalisation between 'boxes', especially for what would
               | happen in a later box.)
               | 
               | Another interesting example is Taylor series. For many
               | simple functions Taylor series are great, but for even
               | moderately complicated ones you need to be careful. See
               | eg how the Taylor serious for the logarithm around x=1
               | works well, but if you tried it around x=0, you are in
               | for a bad time.
               | 
               | The interesting observation isn't just that there are
               | multiple universal approximators, but that at high enough
               | parameter count, they seem to perform about equally well
               | in how good they are at approximating in practice (but
               | differ in how well they can be trained).
        
               | leereeves wrote:
               | > Training your neural net only fiddles with the
               | parameters like a and b. It doesn't do anything about the
               | shape of the function. It doesn't change sine into
               | multiplication etc.
               | 
               | It definitely can. The output will always be piecewise
               | linear (with ReLU), but the overall shape can change
               | completely.
        
               | ziofill wrote:
               | You can fit any data with enough parameters. What's
               | tricky is to constrain a model so that it approximates
               | the ground truth well where there are no data points. If
               | a family of functions is extremely flexible and can fit
               | all kinds of data very efficiently I would argue it makes
               | it harder for those functions to have correct values out
               | of distribution.
        
               | leereeves wrote:
               | Definitely. That's a fundamental observation called the
               | bias-variance tradeoff. More flexible models are prone to
               | overfitting, hitting each training point exactly with
               | wild gyrations in between.
               | 
               | Big AI minimizes that problem by using more data. So much
               | data that the model often only sees each data point once
               | and overfitting is unlikely.
        
               | ziofill wrote:
               | But while keeping the data constant, adding more and more
               | parameters is a strategy that works, so what gives? Are
               | the functions getting somehow regularized during training
               | so effectively you could get away with fewer parameters,
               | it's just that we don't have the right model just yet?
        
               | eru wrote:
               | Sorry, when I meant 'shape' of the function, I meant the
               | shape of the abstract syntax tree (or something like
               | that).
               | 
               | Not the shape of its graph when you draw it.
        
               | refulgentis wrote:
               | More directly than my first attempt: you're continuing
               | the error here. The nave's approach of "it's
               | approximating some function" both maps to reality and
               | makes accurate predictions. The more we couple ourselves
               | to "no no no, it's modeling a precise function", the more
               | we end up wrong, both on how it works in theory and in
               | practice.
        
             | dboreham wrote:
             | This reminds me of control systems theory where provided
             | there's feedback, the forward transfer function doesn't
             | matter beyond very basic properties around the origin.
        
           | mirekrusin wrote:
           | Isn't bogosort transformer and quicksort proposed modified
           | rnn (175 times faster training for 500 seq) here?
        
           | f1shy wrote:
           | Wait! We certainly did NOT have huge datasets (like current
           | internet) for ages. Not even decades. I've seen a lecture by
           | a MIT professor (which I cannot find now) where he asserted
           | categorically, that the advances in AI are mostly because of
           | the huge data that we now have and we didn't before. And that
           | was an _old_ video.
        
             | yosefk wrote:
             | Whichever way it's true in, it's not true in the sense that
             | eg you can approximate any curve with a single layer neural
             | net, and you're not actually going to be able to do it for
             | problems CNNs or transformers work decently on. And Google
             | indexed all of the public Internet way before its
             | researchers came up with transformers.
             | 
             | Another way to look at it is that like you say, it was an
             | old video but there has been progress since though we had
             | large datasets when it came out by its own definition
        
           | tsimionescu wrote:
           | I think by far the biggest advances are related to compute
           | power. The amount of processing needed to run training
           | algorithms on the amounts of data needed for the latest
           | models was just not possible even five years ago, and
           | definitely not ten years ago.
           | 
           | I'm sure there are optimizations from the model shape as
           | well, but I don't think that running the best algorithms we
           | have today with hardware from five-ten years ago would have
           | worked in any reasonable amount of time/money.
        
             | freeqaz wrote:
             | A 30bn param model, hell even a 7bn param model, is still
             | incredibly useful and I feel like that could have been
             | doable a decade ago!
             | 
             | We have GPT-4 (or at least 3.5) tier performance in these
             | much smaller models now. If we teleported back in time it
             | may have been possible to build
        
               | tsimionescu wrote:
               | I think the size of the model is only one part of it.
               | They're still training these 7bn parameter models on the
               | whole data set, and just crunching through that takes
               | enormous compute, that people just didn't have at the
               | current price points until now.
               | 
               | I should also mention that the idea itself of using GPUs
               | for compute and then specifically for AI training was an
               | innovation. And the idea that simply scaling up was going
               | to be worth the investment is another major innovation.
               | It's not just the existence of the compute power, it's
               | the application to NN training tasks that got us here.
               | 
               | Here[0] is an older OpenAI post about this very topic.
               | They estimate that between 2012 and 2018, the compute
               | power used for training the SotA models at those times
               | increased roughly 300,000 times, doubling every ~3.5
               | months.
               | 
               | [0] https://openai.com/index/ai-and-compute/
        
         | _giorgio_ wrote:
         | Chollet is just a philosopher. He also thinks that keras and
         | tensorflow are important, when nobody uses those. And he
         | punished false days about their usage.
        
         | eru wrote:
         | Well, you also need an approach to 'curve fitting' where it's
         | actually computationally feasible to fit the curve. The
         | approach of mixing layers of matrix multiplication with a
         | simple non-linearity like max(0, x) (ReLU) works really well
         | for that. Earlier on they tried more complicated non-
         | linearities, like sigmoids, or you could try an arbitrary curve
         | that's not split into layers at all, you would probably find it
         | harder. (But I'm fairly sure in the end you might end up in the
         | same place, just after lots more computation spent on fitting.)
        
         | tippytippytango wrote:
         | Inductive bias matters. A lot.
        
         | avereveard wrote:
         | well yes but actually no I guess: the transformers benefit at
         | the time was that they were more stable while learning,
         | enabling larger and larger network and dataset to be learnt.
        
         | WithinReason wrote:
         | If you spent some time actually training networks you know
         | that's not true, that's why batch norm, dropout, regularization
         | is so successful. They don't increase the network's capacity
         | (parameter count) but they increase its ability to learn.
        
       | m11a wrote:
       | It'd be nice to see more of how this compares to Mamba. Looks
       | like, in performance, they're not leagues apart and it's just a
       | _different_ architecture, not necessarily better or worse?
        
         | yazzku wrote:
         | Look at the memory consumption diagram on page 6. It looks like
         | you're basically getting the same running time for less memory
         | usage.
        
       | dsamarin wrote:
       | The name of the paper contrasts with the paper that spawned
       | Transformer architecture, which itself is a reference to the song
       | "All You Need Is Love" by the Beatles.
       | https://en.wikipedia.org/wiki/Attention_Is_All_You_Need
        
         | vundercind wrote:
         | I eagerly await the backlash to suggesting any one thing is all
         | you need, the first shot of which shall surely be titled: "'All
         | you need' Considered Harmful"
        
           | ants_everywhere wrote:
           | Surely the universe is all you need though
        
             | radarsat1 wrote:
             | Interstellar taught me that love transcends the universe.
             | Ergo..
        
       | marcosdumay wrote:
       | R == Recurrent
       | 
       | From theory the answer to the question should be "yes", they are
       | Turing complete.
       | 
       | The real question is about how to train them, and the paper is
       | about that.
        
         | baanist wrote:
         | Why aren't AI researchers automating the search for efficient
         | architectures?
        
           | ks2048 wrote:
           | https://en.wikipedia.org/wiki/Neural_architecture_search
        
           | kelseyfrog wrote:
           | The search space is all off too wide, difficult to
           | parameterize, and there is a wide gap between effective and
           | ineffective architectures - ie: a very small change can make
           | a network effectively DOA.
        
             | hedgehog wrote:
             | Notably architecture search was popular for small vision
             | nets where the cost of many training runs was low enough. I
             | suspect some of the train-then-prune approaches will come
             | back, but even there only by the best funded teams.
        
           | ActorNightly wrote:
           | There has been some work, but the problem is that its such a
           | massive search space. Philosophically speaking, if you look
           | at how humans came into existence, you could make an argument
           | that the process of evolution from basic lifeforms can be
           | represented as one giant compute per minute across of all of
           | earth, where genetic selection happens and computation
           | proceeds to the next minute. Thats a fuckload of compute.
           | 
           | In more practical terms, you would imagine that an advanced
           | model contains some semblance of a CPU to be able to truly
           | reason. Given that CPUs can be all NAND gates (which take 2
           | neurons to represent), and are structured in a recurrent way,
           | you fundamentally have to rethink how to train such a
           | network, because backprop obviously won't work to capture
           | things like binary decision points.
        
             | baanist wrote:
             | I thought the whole point of neural networks was that they
             | were good at searching through these spaces. I'm pretty
             | sure OpenAI is pruning their models behind the scenes to
             | reduce their costs because that's the only way they can
             | keep reducing the cost per token. So their secret sauce at
             | this point is whatever pruning AI they're using to whittle
             | the large computation graphs into more cost efficient
             | consumer products.
        
               | spencerchubb wrote:
               | When you train a neural network, it is not search, it is
               | descending through a curve.
               | 
               | If you were to search for billions of parameters by brute
               | force, you literally could not do it in the lifespan of
               | the universe.
               | 
               | A neural network is differentiable, meaning you can take
               | the derivative of it. You train the parameters by taking
               | finding gradient with respect to each parameter, and
               | going in the opposite direction. Hence the name of the
               | popular algorithm, gradient descent.
        
               | bob1029 wrote:
               | A biological neural network is certainly not
               | differentiable. If the thing we want to build is not
               | realizable with this technique, why can't we move on from
               | it?
               | 
               | Gradient descent isn't the only way to do this.
               | Evolutionary techniques can explore impossibly large,
               | non-linear problem spaces.
               | 
               | Being able to define any kind of fitness function you
               | want is sort of like a super power. You don't have to
               | think in such constrained ways down this path.
        
               | og_kalu wrote:
               | >A biological neural network is certainly not
               | differentiab
               | 
               | Biology is biology and has its constraints. Doesn't
               | necessarily mean a biologically plausible optimizer would
               | be the most efficient or correct way in silicon.
               | 
               | >If the thing we want to build is not realizable with
               | this technique, why can't we move on from it?
               | 
               | All the biologically plausible optimizers we've fiddled
               | with (and we've fiddled with quite a lot) just work
               | (results wise) like gradient descent but worse. We've not
               | "moved on" because gradient descent is and continues to
               | be better.
               | 
               | >Evolutionary techniques can explore impossibly large,
               | non-linear problem spaces.
               | 
               | Sure, with billions of years (and millions of concurrent
               | experiments) on the table.
        
           | xpe wrote:
           | Program synthesis is a generalization of this. I'm not sure
           | that many ML researchers have thought about the connections
           | yet.
        
         | jjtheblunt wrote:
         | What are you saying is Turing-complete?
        
           | baanist wrote:
           | Neural networks are Turing complete, i.e. there is a
           | universal neural network that can compute any effectively
           | computable function1. Incidentally, when this is combined
           | with Rice's theorem2 it means that safety research is
           | essentially an unsolvable problem because any non-trivial
           | property of a sufficiently complex neural network, e.g. one
           | that can simulate a Turing machine, will have properties
           | which can not be predicted with finite computation.
           | 
           | 1: https://www.sciencedirect.com/science/article/pii/08939659
           | 91...
           | 
           | 2: https://en.wikipedia.org/wiki/Rice%27s_theorem?useskin=vec
           | to...
        
             | jjtheblunt wrote:
             | super interesting, and i'd not seen either reference.
             | thanks very much.
        
       | logicchains wrote:
       | The model in the paper isn't a "real" RNN due making it
       | parallelizable, for same the reasons described in
       | https://arxiv.org/abs/2404.08819 , and hence is theoretically
       | less powerful than a "real" RNN (struggles at some classes of
       | problems that RNNs traditionally excel at). On the other hand,
       | https://arxiv.org/abs/2405.04517 contains a "real" RNN component,
       | which demonstrates a significant improvement on the kind of
       | state-tracking problems that transformers struggle with.
        
         | robertsdionne wrote:
         | These are real RNNs, they still depend upon the prior hidden
         | state, it's just that the gating does not. The basic RNN
         | equation can be parallelized with parallel prefix scan
         | algorithms.
        
       | bob1029 wrote:
       | > Transformers required ~2.5x more training steps to achieve
       | comparable performance, overfitting eventually.
       | 
       | > RNNs are particularly suitable for sequence modelling settings
       | such as those involving time series, natural language processing,
       | and other sequential tasks where context from previous steps
       | informs the current prediction.
       | 
       | I would like to draw an analogy to digital signal processing. If
       | you think of the recurrent-style architectures as IIR filters and
       | feedforward-only architectures as FIR filters, you will likely
       | find many parallels.
       | 
       | The most obvious to me being that IIR filters typically require
       | far fewer elements to produce the same response as an equivalent
       | FIR filter. Granted, the FIR filter is often easier to
       | implement/control/measure in practical terms (fixed-point
       | arithmetic hardware == ML architectures that can run on GPUs).
       | 
       | I don't think we get to the exponential scary part of AI without
       | some fundamentally recurrent architecture. I think things like
       | LSTM are kind of an in-between hack in this DSP analogy - You
       | could look at it as FIR with dynamic coefficients. Neuromorphic
       | approaches seem like the best long term bet to me in terms of
       | efficiency.
        
         | wslh wrote:
         | ELI5: Could you explain what neuromorphic approaches mean, and
         | how they contribute to AI/AGI? My first impression as a
         | layperson (probably wrong) is that this approach resembles
         | ideas from the book "The Society of the Mind", where the system
         | isn't just simulating neurons but involves a variety of methods
         | and interactions across "agents" or sub-systems.
        
           | bob1029 wrote:
           | Neuromorphic mostly just means "like how the brain works". It
           | encompasses a variety of software & hardware approaches.
           | 
           | The most compelling and obvious one to me is hardware
           | purpose-built to simulate spiking neural networks. In the
           | happy case, SNNs are extremely efficient. Basically consuming
           | no energy. You could fool yourself into thinking we can just
           | do this on the CPU due to the sparsity of activations. I
           | think there is even a set of problems this works well for.
           | But, in the unhappy cases SNNs are impossible to simulate on
           | existing hardware. Neuronal avalanches follow power law
           | distribution and meaningfully-large ones would require very
           | clever techniques to simulate with any reasonable fidelity.
           | 
           | > the system isn't just simulating neurons but involves a
           | variety of methods and interactions across "agents" or sub-
           | systems.
           | 
           | I think the line between "neuron" and "agent" starts to get
           | blurry in this arena.
        
             | seanhunter wrote:
             | We somehow want a network that is neuromorphic in structure
             | but we don't want it to be like the brain and take 20 years
             | or more to train?
             | 
             | Secondly how do we get to claim that a particular thing is
             | neuromorphic when we have such a rudimentary understanding
             | of how a biological brain works or how it generates things
             | like a model of the world, understanding of self etc etc.
        
               | planetpluta wrote:
               | Something to consider is that it really could take 20+
               | years to train like a brain. But once you've trained it,
               | you can replicate at ~0 cost, unlike a brain.
        
               | kybernetikos wrote:
               | > we don't want it to be like the brain and take 20 years
               | or more to train?
               | 
               | Estimates put training of gpt4 at something like 2500 gpu
               | years to train, over about 10000 gpus. 20 years would be
               | a big improvement.
        
               | seanhunter wrote:
               | 1 GPU year is in no way comparable to 1 chronological
               | year of learning for a human brain though.
        
               | kybernetikos wrote:
               | Yes, but the underlying point is that in this case you
               | can train the AI in parallel, and there's a decent chance
               | this or something like it will be true for future AI
               | architectures too. What does it matter that the AI needs
               | to be trained on 20 years of experiences if all of those
               | 20 years can be experienced in 6 months given the right
               | hardware?
        
             | wslh wrote:
             | My take, for pragmatic reasons rather than how the brain
             | actually works, is that an agent-based architecture is
             | great because some tasks can be solved more effectively by
             | specific algorithms or workflows rather than operating at
             | the low level of neural networks (NN).
        
           | mafribe wrote:
           | Neuromorphic has been an ongoing failure (for general purpose
           | processors or even AI accelerators), ever since Carver Mead
           | introduced (and quickly abandoned them) them nearly half a
           | century ago. Bill Dally (NVidia CTO) concurs: _" I keep
           | getting those calls from those people who claim they are
           | doing neuromorphic computing and they claim there is
           | something magical about it because it's the way that the
           | brain works ... but it's truly more like building an airplane
           | by putting feathers on it and flapping with the wings!"_
           | From: Hardware for Deep Learning, HotChips 2023 keynote.
           | 
           | We have NO idea how the brain produces intelligence, and as
           | long as that doesn't change, "neuromorphic" is merely a
           | marketing term, like Neurotypical, Neurodivergent,
           | Neurodiverse, Neuroethics, Neuroeconomics, Neuromarketing,
           | Neurolaw, Neurosecurity, Neurotheology, Neuro-Linguistic
           | Programming: the "neuro-" prefix is suggesting a deep
           | scientific insight to fool the audience. There is no hope of
           | us cracking the question of how the human brain produces
           | high-level intelligence in the next decade or so.
           | 
           | Neuromorphic does work for some special purpose applications.
        
             | chasd00 wrote:
             | I like the feather analogy. Early on all humans knew about
             | flight was from biology (watching birds fly) but trying to
             | make a flying machine modeled after a bird would never
             | work. We can fly today but plane designs are nothing like
             | biological flying machines. In the same way, all we know
             | about intelligence comes from biology and trying to invent
             | an AGI modeled on biological intelligence may be just as
             | impossible as a plane designed around how birds fly.
             | 
             | /way out of my area of expertise here
        
               | quotemstr wrote:
               | And it's only now, having built our own different kind of
               | flying machine, that we understand the principles of
               | avian flight well enough to build our own ornithopters.
               | (We don't use ornithopters because they're not practical,
               | but we've known how to build them since the 1960s.) We
               | would have never gotten here had we just continued to try
               | to blindly copy birds.
        
           | fennecfoxy wrote:
           | I love this book and have it sitting on my shelf right now!
           | Read it when I was a kid and was amazed at the ideas in it,
           | nowadays it's clearer to me that the author only had a grasp
           | of how things like that would be built but still cool
           | nonetheless.
           | 
           | I would highly recommend it to people who love a good "near
           | future" scifi book.
        
             | bwanab wrote:
             | I'm sure you know this, but I think "the author" Marvin
             | Minsky should be mentioned by name since he was one of the
             | foundational theorists in the field of AI in general, but
             | particularly in NNs.
        
         | manjunaths wrote:
         | Can we even implement IIR filters to give good performance and
         | scaling at large scale on current architectures like GPUs ?
        
           | bob1029 wrote:
           | I don't think so. FIR filters can be unrolled and
           | parallelized over the data. These are definitely possible to
           | do on GPU to great effect. But, IIR filters constantly depend
           | on the output of the prior time step, so you can't unroll
           | anything. These would probably be faster to simulate on the
           | CPU.
        
         | x3haloed wrote:
         | > I don't think we get to the exponential scary part of AI
         | without some fundamentally recurrent architecture
         | 
         | I've been thinking the same for a while, but I'm starting to
         | wonder if giant context windows are good enough to get us
         | there. I think recurrency is more neuromorphic, and possibly
         | important in the longer run, but maybe not required for SI.
         | 
         | I'm also just a layman with just a surface level understanding
         | of these things, so I may be completely ignorant and wrong.
        
         | lr1970 wrote:
         | Again from signal processing: depending on position of the
         | poles in z-transformed filter transfer function the output of
         | IIR has a narrow stability region that is typically carefully
         | designed for. Otherwise IIR filters either exponentially decay
         | to zero to exponentially grow to infinity. RNN cells like LSTM
         | are "decaying filters" with non-linear gates introduced to stop
         | decay and to "remember" things.
         | 
         | FIR filters are way simpler to design and can capture memory
         | without hacks.
        
       | PunchTornado wrote:
       | To me this is further evidence that these LLMs learn only to
       | speak English, but there is no reasoning at all in them. If you
       | simplify a lot and obtain the same results and we know how
       | complex the brain is.
        
         | quantadev wrote:
         | Every LLM expert on the planet agrees LLMs are doing
         | "reasoning". No one says they have feelings or qualia, but we
         | all know there's definitely genuinely artificial reasoning
         | happening.
         | 
         | What LLMs have shown both Neuroscience and Computer Science is
         | that reasoning is a mechanical process (or can be simulated by
         | mechanical processes) and is not purely associated only with
         | consciousness.
        
           | roboboffin wrote:
           | I'm not sure that's true at all. There are several well known
           | researchers that say LLMs are in fact not doing reasoning.
        
             | quantadev wrote:
             | Those are all the people that have not yet decoupled
             | "reasoning" from "consciousness" in their own way of
             | thinking. It's admittedly hyperbolic to say "everyone". I
             | love hyperbole on HN. :)
        
               | roboboffin wrote:
               | For example, papers like this call into question whether
               | or not a LLM can plan:
               | 
               | https://arxiv.org/html/2409.13373v1
               | 
               | This is a basic form of reasoning, to plan out the steps
               | needed to execute something.
        
               | quantadev wrote:
               | Planning, by definition, takes multiple reasoning steps.
               | A single LLM inference is a fundamental single reasoning
               | step, but it's a reasoning step nonetheless.
               | 
               | It's like I'm saying a house is made of bricks. You can
               | build a house of any shape out of bricks. But once bricks
               | have been invented you can build houses. The LLM
               | "reasoning" that even existed as early as GPT3.5 was the
               | "brick" with which highly intelligent agents can be built
               | out of, with no further "breakthroughs" being required.
               | 
               | The basic Transformer Architecture was enough and already
               | has the magical ingredient of reasoning. The rest is just
               | a matter of prompt engineering.
        
               | roboboffin wrote:
               | It's not reasoning, it retrieval of a pattern, and that
               | pattern may contain reasoning.
               | 
               | The prompt engineering is the real reasoning, provided by
               | the human.
        
               | quantadev wrote:
               | Yeah, these kinds of discussions always devolve purely
               | into debates about what's the proper definition of words.
               | Especially on HN where everyone has their "Pedantic Knob"
               | dialed up to 11.
        
               | roboboffin wrote:
               | I understand your point. I apologise, if I am coming
               | across pendantic.
               | 
               | My point is computers already follow algorithms, and
               | algorithms contain reasoning; but the computers are not
               | reasoning themselves. At least, not yet!
        
               | arolihas wrote:
               | You're not being pedantic at all. It's a crucial
               | distinction that people try to wave away in favor of
               | hype. Especially since we are so vulnerable to
               | anthropomorphizing.
        
               | quantadev wrote:
               | You weren't being pedantic yourself. My point is that
               | this discussion is ultimately about the definition of
               | words, and that all by itself, makes the discussion
               | meaningless.
               | 
               | I think a "granule" of "reasoning" happens at each
               | inference, and you think there is no reasoning in a
               | single inference. To discuss it further would be a game
               | of whose definition of any given word is correct.
        
       | adamnemecek wrote:
       | Yes, all machine learning can be interpreted in terms of
       | approximating the partition function.
       | 
       | This is obvious when one considers the connections between
       | Transformers, RNNs, Hopfield networks and the Ising model, a
       | model from statistical mechanics which is solved by calculating
       | the partition function.
       | 
       | This interpretation provides us with some very powerful tools
       | that are commonplace in math and physics but which are not talked
       | about in CS & ML.
       | 
       | I'm working on a startup http://traceoid.ai which takes this
       | exact view. Our approach enables faster training and inference,
       | interpretability and also scalable energy-based models, the Holy
       | Grail of machine learning.
       | 
       | Join the discord https://discord.com/invite/mr9TAhpyBW or follow
       | me on twitter https://twitter.com/adamnemecek1
        
       | mkaic wrote:
       | I strongly enjoy the simplicity of their "minGRU" architecture.
       | It's basically just:                 class MinGRU(nn.Module):
       | def __init__(self, token_size, hidden_state_size):
       | self.token_to_proposal = nn.Linear(token_size, hidden_size)
       | self.token_to_mix_factors = nn.Linear(token_size, hidden_size)
       | def forward(self, previous_hidden_state, current_token):
       | proposed_hidden_state = self.token_to_proposal(current_token)
       | mix_factors =
       | torch.sigmoid(self.token_to_mix_factors(current_token))
       | return torch.lerp(proposed_hidden_state, previous_hidden_state,
       | mix_factors)
       | 
       | And since the proposed hidden states and mix factors for each
       | layer are both only dependent on the current token, you can
       | compute all of them in parallel if you know the whole sequence
       | ahead of time (like during training), and then combine them in
       | linear time using parallel scan.
       | 
       | The fact that this is competitive with transformers and state-
       | space models in their small-scale experiments is gratifying to
       | the "best PRs are the ones that delete code" side of me. That
       | said, we won't know for sure if this is a capital-B Breakthrough
       | until someone tries scaling it up to parameter and data counts
       | comparable to SOTA models.
       | 
       | One detail I found really interesting is that they seem to do all
       | their calculations in log-space, according to the Appendix. They
       | say it's for numerical stability, which is curious to me--I'm not
       | sure I have a good intuition for why running everything in log-
       | space makes the model more stable. Is it because they removed the
       | tanh from the output, making it possible for values to explode if
       | calculations are done in linear space?
       | 
       | EDIT: Another thought--it's kind of fascinating that this sort of
       | sequence modeling works at all. It's like if I gave you all the
       | pages of a book individually torn out and in a random order, and
       | asked you to try to make a vector representation for each page as
       | well as instructions for how to mix that vector with the vector
       | representing all previous pages -- except you have zero knowledge
       | of those previous pages. Then, I take all your page vectors,
       | sequentially mix them together in-order, and grade you based on
       | how good of a whole-book summary the final vector represents.
       | Wild stuff.
       | 
       | FURTHER EDIT: Yet _another_ thought--right now, they 're just
       | using two dense linear layers to transform the token into the
       | proposed hidden state and the lerp mix factors. I'm curious what
       | would happen if you made those transforms MLPs instead of
       | singular linear layers.
        
         | immibis wrote:
         | This architecture, on the surface, seems to preclude the basic
         | function of recognizing sequences of tokens. At the very least,
         | it seems like it should suffer from something like the pumping
         | lemma: if [the ][cat ][is ][black ] results in the output
         | getting close to a certain vector, [the ][cat ][is ][black
         | ][the ][cat ][is ][black ][the ][cat ][is ][black ] should get
         | even closer to that vector and nowhere close to a "why did you
         | just repeat the same sentence three times" vector? Without non-
         | linear mixing between input token and hidden state, there will
         | be a lot of linear similarities between similar token
         | sequences...
        
           | mkaic wrote:
           | Counterpoint: the hidden state at the beginning of
           | ([the][cat][is][black]) x 3 is (probably) initialized to all
           | zeros, but after seeing those first 4 tokens, it will _not_
           | be all zeros. Thus, going into the second repetition of the
           | sentence, the model has a different initial hidden state, and
           | should exhibit different behavior. I think this makes it
           | possible for the model to learn to recognize repeated
           | sequences and avoid your proposed pitfall.
        
             | immibis wrote:
             | The new hidden state after the first repetition will just
             | be a linear combination between zero and what the non-
             | recurring network outputs. After more repetitions, it will
             | be closer to what the network outputs.
        
         | slashdave wrote:
         | Log space is important if the token probabilities span a large
         | range of values (powers). There is a reason that maximum
         | likelihood fitting is always performed with log likelihoods.
        
         | aDyslecticCrow wrote:
         | I don't think it's a capital-B Breakthrough, but recurrent
         | networks are everywhere, and a simplification that improved
         | training and performance clears the stage to build back
         | complexity up again to even higher hights.
        
       | trott wrote:
       | My feeling is that the answer is "no", in the sense that these
       | RNNs wouldn't be able to universally replace Transformers in
       | LLMs, even though they might be good enough in some cases and
       | beat them in others.
       | 
       | Here's why.
       | 
       | A user of an LLM _might_ give the model some long text and then
       | say  "Translate this into German please". A Transformer can look
       | back at its whole history. But what is an RNN to do? While the
       | length of its context is unlimited, the amount of information the
       | model retains about it is bounded by whatever is in its hidden
       | state at any given time.
       | 
       | Relevant: https://arxiv.org/abs/2402.01032
        
         | mkaic wrote:
         | The counterargument here is that you can just scale the size of
         | the hidden state sufficiently such that it can hold compressed
         | representations of whatever-length sequence you like.
         | Ultimately, what I care about is whether RNNs could compete
         | with transformers if FLOPs are held constant--something TFA
         | doesn't really investigate.
        
           | psb217 wrote:
           | Well, that's what Transformer already does... One problem
           | with the scaling you're describing is that there would be a
           | massive amount of redundant information stored in hidden
           | activations during training the RNN. The hidden state at each
           | time step t in the sequence would need to contain all info
           | that (i) could be useful for predicting the token at time t
           | and (ii) that could be useful for predicting tokens at times
           | >t. (i) is obvious and (ii) is since all information about
           | the past is transferred to future predictions through the
           | current hidden state. In principle, Transformers can avoid
           | storing redundant info in multiple hidden states at the cost
           | of having to maintain and access (via attention) a larger
           | hidden state at test/eval time.
        
             | mkaic wrote:
             | > there would be a massive amount of redundant information
             | stored in hidden activations
             | 
             | Is there a way to prove this? One potential caveat that
             | comes to mind for me is that perhaps the action of lerping
             | between the old state and the new could be used by the
             | model to perform semantically meaningful transformations on
             | the old state. I guess in my mind it just doesn't seem
             | obvious that the hidden state is necessarily a collection
             | of "redundant information" -- perhaps the information is
             | culled/distilled the further along in the sequence you go?
             | There will always be _some_ redundancy, sure, but I don 't
             | think that such redundancy necessarily means we _have_ to
             | use superlinear methods like attention.
        
               | psb217 wrote:
               | All information about the past which will be available
               | for predicting future tokens must be stored in the
               | present state. So, if some bits of info about some past
               | tokens at times less than t_p will be used for predicting
               | some future token at time t_f, those bits must be passed
               | through all states at times from t_p to t_f. The bits are
               | passed through the recurrence. Once information about
               | past tokens is lost from the hidden state it is gone
               | forever, so it must be stored and carried across many
               | steps up until it finally becomes useful.
               | 
               | The information cost of making the RNN state way bigger
               | is high when done naively, but maybe someone can figure
               | out a clever way to avoid storing full hidden states in
               | memory during training or big improvements in hardware
               | could make memory use less of a bottleneck.
        
         | phkahler wrote:
         | >> A user of an LLM might give the model some long text and
         | then say "Translate this into German please". A Transformer can
         | look back at its whole history.
         | 
         | Which isn't necessary. If you say "translate the following to
         | german." Instead, all it needs is to remember the task at hand
         | and a much smaller amount of recent input. Well, and the
         | ability to output in parallel with processing input.
        
           | og_kalu wrote:
           | It's necessary for arbitrary information processing if you
           | can forget and have no way to "unforget".
           | 
           | A model can decide to forget something that turns out to be
           | important for some future prediction. A human can go back and
           | re-read/listen etc, A transformer is always re-reading but a
           | RNN can't and is fucked.
        
             | magicalhippo wrote:
             | That's just because we twisted it's arm. One could for
             | example feed the reversed input after, ie abc|cba where |
             | is a special token. That would allow it to react to any
             | part of the message.
        
               | ebalit wrote:
               | I think this might be key, in addition to some landmark
               | tokens to quickly backtrack to. The big question is how
               | to train such model.
               | 
               | There is a recent paper from Meta that propose a way to
               | train a model to backtrack its generation to improve
               | generation alignment [0].
               | 
               | [0] https://arxiv.org/html/2409.14586v1
        
             | tsimionescu wrote:
             | If the networks are to ever be a path to a closer to
             | general intelligence, they will anyway need to be able to
             | ask for context to be repeated, or to have separate storage
             | where they can "choose" to replay it themselves. So this
             | problem likely has to be solved another way anyway, both
             | for transformers and for RNNs.
        
               | og_kalu wrote:
               | For a transformer, context is already always being
               | repeated every token. They can fetch information that
               | _became_ useful anytime they want. I don 't see what
               | problem there is to solve here.
        
               | tsimionescu wrote:
               | For a transformer, context is limited, so the same kind
               | of problem applies after you exceed some size.
        
           | trott wrote:
           | People did something similar to what you are describing 10
           | years ago: https://arxiv.org/abs/1409.0473
           | 
           | But it's trained on translations, rather than the whole
           | Internet.
        
           | DoctorOetker wrote:
           | Also, a lightweight network could do a first pass to identify
           | tasks, instructions, constraints etc, and then a second pass
           | could use the RNN.
           | 
           | Consider the flood fill algorithm or union-find algorithm,
           | which feels magical upon first exposure.
           | 
           | https://en.wikipedia.org/wiki/Hoshen%E2%80%93Kopelman_algori.
           | ..
           | 
           | Having 2 passes can enable so much more than a single pass.
           | 
           | Another alternative could be to have a first pass make notes
           | in a separate buffer while parsing the input. The bandwidth
           | of the note taking and reading can be much much lower than
           | that required for fetching the billions of parameters.
        
         | slashdave wrote:
         | > the amount of information the model retains about it is
         | bounded by whatever is in its hidden state
         | 
         | This is no different than a transformer, which, after all, is
         | bound by a finite state, just organized in a different manner.
        
           | trott wrote:
           | > This is no different than a transformer, which, after all,
           | is bound by a finite state, just organized in a different
           | manner.
           | 
           | It's not just a matter of organizing things differently.
           | Suppose your network dimension and sequence length are both
           | X.
           | 
           | Then your memory usage (per layer) will be O(X^2), while your
           | training update cost will be O(X^3). That's for both
           | Transformers and RNNs.
           | 
           | However, at the end of the sequence, a Transformer layer can
           | look back see O(X^2) numbers, while an RNN can only see O(X)
           | numbers.
        
             | slashdave wrote:
             | Simplistic thinking. An RNN hidden parameter space of high
             | dimension provides plenty of room for linear projections of
             | token histories. I think people just do not realize just
             | how huge R^N can be.
        
               | trott wrote:
               | > Simplistic thinking. An RNN hidden parameter space of
               | high dimension provides plenty of room for linear
               | projections of token histories. I think people just do
               | not realize just how huge R^N can be.
               | 
               | 16N bits as hard limit, but more realistically, about 2N
               | bits or less of useful information probably.
               | 
               | You'd need to grow the network dimension in proportion to
               | the maximum sequence length just to avoid the information
               | theoretical limit.
        
             | f_devd wrote:
             | Transformers actually have an quantifiable state size (see 
             | https://hazyresearch.stanford.edu/static/posts/2024-06-22-a
             | c...) although it's anywhere between 200k and 2M floats
             | (for 360M and 1.33B respectively iinm). So a sufficiently
             | sized RNN could have the same state capacity as a
             | transformer.
             | 
             | (this is from the Based paper:
             | https://arxiv.org/pdf/2402.18668)
        
               | trott wrote:
               | > Transformers actually have an quantifiable state size
               | 
               | Are you griping about my writing O(X^2) above instead of
               | precisely 2X^2, like this paper? The latter implies the
               | former.
               | 
               | > So a sufficiently sized RNN could have the same state
               | capacity as a transformer.
               | 
               | Does this contradict anything I've said? If you increase
               | the size of the RNN, while keeping the Transformer fixed,
               | you can match their recurrent state sizes (if you don't
               | run out of RAM or funding)
        
               | f_devd wrote:
               | I was responding to
               | 
               | > a Transformer layer can look back see O(X^2) numbers,
               | while an RNN can only see O(X) numbers
               | 
               | The thing is RNN can look back infinitely if you don't
               | exceed the state capacity. For transformers the state it
               | is defined semi-implicitly (you can change the hidden
               | dims but you cannot extend the look back; ignoring
               | transformer-xl et al.) defined by the amount of tokens,
               | for an RNN it's defined explicitly by the state size.
               | 
               | The big-O here is irrelevant for the architectures since
               | it's all in the configuration & implementation of the
               | model; i.e. there is no relevant asymptote to compare.
               | 
               | As an aside this was what was shown in the based paper,
               | the fact that you can have a continuity of state (as with
               | RNN) while have the same associative recall capability as
               | a transformer (the main downfall of recurrent methods at
               | that point).
        
               | trott wrote:
               | > The big-O here is irrelevant for the architectures
               | since it's all in the configuration & implementation of
               | the model; i.e. there is no relevant asymptote to
               | compare.
               | 
               | ?!
               | 
               | NNs are like any other algorithm in this regard. Heck,
               | look at the bottom of page 2 of the Were RNNs All We
               | Needed paper. It has big-O notation there and elsewhere.
               | 
               | > I was responding to
               | 
               | >> a Transformer layer can look back see O(X^2) numbers,
               | while an RNN can only see O(X) numbers
               | 
               | In the BASED paper, in Eq. 10, sizeof(s) = 2dN. But I
               | defined d = N = X above. Ergo, sizeof(s) = 2X^2 = O(X^2).
               | 
               | For minGRU, sizeof(s) = d. Ergo, sizeof(s) = X = O(X).
        
               | f_devd wrote:
               | That's just the state calculation which would be O(N) and
               | O(1) respectively. The based paper is saying if you made
               | Transformers recurrent you would have a state size of 2Nd
               | -> O(N), while based has a state size of d*d' -> O(1).
               | 
               | Transformers do have O(N^2) time & memory complexity, and
               | Based/RNN/SSM {O(N) time, O(1) mem}, with respect to
               | sequence length if that's what you mean. The point is it
               | doesn't really give an indication of quality.
               | 
               | We can choose our constant arbitrarily so the big-O
               | you've stated only indicates memory/time-complexity not
               | 'look-back' ability relevant to any task. If you input
               | the entire sequence N times into an RNN, you also have
               | perfect recall with O(N^2) but it's not exactly an
               | efficient use of our resources.
               | 
               | Ideally our state memory is maximally utilized, this is
               | the case for RNNs in the limit (although likely
               | oversubscribed) but is not the case for transformers. The
               | holy grail is to have an input-dependent state-size,
               | however that is quite difficult.
        
         | tgv wrote:
         | That problem has plagued RNNs since the 90s: there's an
         | information precision problem (how many bits do you need older
         | states to carry), a decay problem (the oldest information is
         | the weakest) and a mixing problem (it tends to mix/sum
         | representations).
        
       | fhdsgbbcaA wrote:
       | We really need a [preprint] flag for unreviewed papers.
        
         | lgessler wrote:
         | IMHO reviews are almost indistinguishable from noise at the AI
         | conferences I'm familiar with these days anyway, so I don't see
         | much of a value add.
        
           | fhdsgbbcaA wrote:
           | Sad state of affairs, people are incentivized to get more
           | papers and citations at all costs, and quality be damned.
           | 
           | An AI Winter is not a great an idea, but an AI Autumn may be
           | beneficial.
           | 
           | Just have no major AI conferences for '25, perhaps only
           | accept really high tier literature reviews.
        
       | limapedro wrote:
       | This is such a interesting paper, sadly they don't have big
       | models, I'd like to see a model trained on TinyStories or even C4
       | since it should be faster than the transformer variant and see
       | how it compares.
        
       | charlescurt123 wrote:
       | I find the entire field lacking when it comes to long-horizon
       | problems. Our current, widely used solution is to scale, but
       | we're nowhere near achieving the horizon scales even small mammal
       | brains can handle. Our models can have trillions of parameters,
       | yet a mouse brain would still outperform them on long-horizon
       | tasks and efficiency. It's something small, simple, and elegant--
       | an incredible search algorithm that not only finds near-optimal
       | routes but also continuously learns on a fixed computational
       | budget.
       | 
       | I'm honestly a bit envious of future engineers who will be
       | tackling these kinds of problems with a 100-line Jupyter notebook
       | on a laptop years from now. If we discovered the right method or
       | algorithm for these long-horizon problems, a 2B-parameter model
       | might even outperform current models on everything except short,
       | extreme reasoning problems.
       | 
       | The only solution I've ever considered for this is expanding a
       | model's dimensionality over time, rather than focusing on perfect
       | weights. The higher dimensionality you can provide to a model,
       | the greater its theoretical storage capacity. This could resemble
       | a two-layer model--one layer acting as a superposition of
       | multiple ideal points, and the other layer knowing how to use
       | them.
       | 
       | When you think about the loss landscape, imagine it with many
       | minima for a given task. If we could create a method that
       | navigates these minima by reconfiguring the model when needed, we
       | could theoretically develop a single model with near-infinite
       | local minima--and therefore, higher-dimensional memory. This may
       | sound wild, but consider the fact that the human brain
       | potentially creates and disconnects thousands of new connections
       | in a single day. Could it be that these connections steer our
       | internal loss landscape between different minima we need
       | throughout the day?
        
         | aDyslecticCrow wrote:
         | Yes... The field lacks the HOLY GRAIL (long-horizon problems).
         | But we don't need a mouse-brain to sort spam emails. The Hail
         | Mary 2B+ parameter models and above are still niche uses of
         | these algorithms (too heavy to run practically). There is
         | plenty of room for clever and small models running on limited
         | hardware and datasets to solve useful problems and nothing
         | more.
         | 
         | Models that change size as needed have been experimented with,
         | but they are either too inefficient or difficult to optimize at
         | a limited power budget. However, I agree that they are likely
         | what is needed if we want to continue to scale upward in size.
         | 
         | I suspect the real bottleneck is a breakthrough in training
         | itself. Backpropagation loss is too simplistic to optimize our
         | current models perfectly, let alone future larger ones. But
         | there is no guarantee a better alternative exists which may
         | create a fixed limit to current ML approaches.
        
       | kgbcia wrote:
       | Decision trees is all we needed
        
       | vandahm wrote:
       | I made a RNN for a college project because I was interested in
       | obsolete historical technology and I thought I needed to seize
       | the opportunity while it lasted, because once I was out of
       | school, I'd never hear about neural networks ever again.
       | 
       | Mine worked, but it was very simple and dog slow, running on my
       | old laptop. Nothing was ever going to run fast on that thing, but
       | I remember my RNN being substantially slower than a feed-forward
       | network would have been.
       | 
       | I was _so confident_ that this was dead technology -- an academic
       | curiosity from the 1980s and 1990s. It was bizarre to see how
       | quickly that changed.
        
         | alkonaut wrote:
         | I feel old. I made my masters thesis on RNN's for learning
         | dynamic systems e.g. for control purposes (quite a novelty at
         | the time, around 2000). We wrote the backprop in C++ and ran it
         | over night. Yes it was slow as hell with the tiny gradients.
         | The network architectures were e.g. 5 or 10 neurons in a single
         | hidden layer. NN's were a tiny subject that you were lucky to
         | find courses in. Then closed my eyes for two seconds and looked
         | at the subject again in 2015. Wow.
        
       | gdiamos wrote:
       | RNNs always had better scaling law curves than transformers.
       | 
       | BPTT was their problem
        
       | Smerity wrote:
       | Excited to see more people working on RNNs but wish their
       | citations were better.
       | 
       | In 2016 my team from Salesforce Research published our work on
       | the Quasi-Recurrent Neural Network[1] (QRNN). The QRNN variants
       | we describe are near identical (minGRU) or highly similar
       | (minLSTM) to the work here.
       | 
       | The QRNN was used, many years ago now, in the first version of
       | Baidu's speech recognition system (Deep Voice [6]) and as part of
       | Google's handwriting recognition system in Gboard[5] (2019).
       | 
       | Even if there are expressivity trade-offs when using
       | parallelizable RNNs they've shown historically they can work well
       | and are low resource and incredibly fast. Very few of the
       | possibilities regarding distillation, hardware optimization, etc,
       | have been explored.
       | 
       | Even if you need "exact" recall, various works have shown that
       | even a single layer of attention with a parallelizable RNN can
       | yield strong results. Distillation down to such a model is quite
       | promising.
       | 
       | Other recent fast RNN variants such as the RWKV, S4, Mamba et al.
       | include citations to QRNN (2016) and SRU (2017) for a richer
       | history + better context.
       | 
       | The SRU work has also had additions in recent years (SRU++),
       | doing well in speech recognition and LM tasks where they found
       | similar speed benefits over Transformers.
       | 
       | I note this primarily as the more data points, especially when
       | strongly relevant, the better positioned the research is. A
       | number of the "new" findings from this paper have been previously
       | explored - and do certainly show promise! This makes sure we're
       | asking new questions with new insights (with all the benefit of
       | additional research from ~8 years ago) versus missing the work
       | from those earlier.
       | 
       | [1] QRNN paper: https://arxiv.org/abs/1611.01576
       | 
       | [2] SRU paper: https://arxiv.org/abs/1709.02755
       | 
       | [3]: SRU++ for speech recognition:
       | https://arxiv.org/abs/2110.05571
       | 
       | [4]: SRU++ for language modeling:
       | https://arxiv.org/abs/2102.12459
       | 
       | [5]: https://research.google/blog/rnn-based-handwriting-
       | recogniti...
       | 
       | [6]: https://arxiv.org/abs/1702.07825
        
       | hdivider wrote:
       | I still find it remarkable how we need such an extreme amount of
       | electrical energy to power large modern AI models.
       | 
       | Compare with one human brain. Far more sophisticated, even beyond
       | our knowledge. What does it take to power it for a day? Some
       | vegetables and rice. Still fine for a while if you supply pure
       | junk food -- it'll still perform.
       | 
       | Clearly we have a long, long way to go in terms of the energy
       | efficiency of AI approaches. Our so-called _neural_ nets clearly
       | don 't resemble the energy efficiency of actual biological
       | neurons.
        
         | Arch485 wrote:
         | It's even less! A lot of those vegetables and rice go into
         | powering your heart, muscles, organs, etc. and only a fraction
         | is used for the brain.
         | 
         | Maybe the future of AI is in organic neurons?
        
         | jjmarr wrote:
         | Food is extremely dense in energy. 1 food calorie is about 1.1
         | Watt-hours. A hamburger is about 490 Wh. An AI model requires
         | 0.047 kWh = 47 Wh to generate 1000 text responses.[1] If an LLM
         | could convert hamburgers to energy, it could generate over
         | 10000 prompt completions on a single hamburger.
         | 
         | Based on my own experience, I would struggle to generate that
         | much text without fries and a drink.
         | 
         | [1] https://www.theverge.com/24066646/ai-electricity-energy-
         | watt...
        
           | hdivider wrote:
           | During that time, your brain would do _far_ more than just
           | that text generation though, beyond what we even know
           | scientifically.
           | 
           | But yes, food energy could be useful for AI. A little
           | dystopian potentially too, if you think about it. Like
           | DARPA's EATR robot, able to run on plant biomass (although
           | potentially animal biomass too, including human remains):
           | 
           | https://en.wikipedia.org/wiki/Energetically_Autonomous_Tacti.
           | ..
        
             | jjmarr wrote:
             | AI is more energy-efficient than a human doing the same
             | language-generation task is my point.
        
         | Legend2440 wrote:
         | This is more likely to be a hardware issue than an algorithms
         | issue. The brain physically is a neural network, as opposed to
         | a software simulation of one.
        
       | lettergram wrote:
       | In 2016 & 2017 my team at Capital One built several >1B parameter
       | models combining LSTMs with a few other tricks.
       | 
       | We were able to build generators that could replicate any dataset
       | they were trained on, and would produce unique deviations, but
       | match the statistical underpinnings of the original datasets.
       | 
       | https://medium.com/capital-one-tech/why-you-dont-necessarily...
       | 
       | We built several text generators for bots that similarly had very
       | good results. The introduction of the transformer improved the
       | speed and reduced the training / data requirements, but honestly
       | the accuracy changed minimal.
        
       | moi2388 wrote:
       | Yes, and it's hardly surprising, since the Chinese room thought
       | experiment is completely wrong; that is in fact exactly how you
       | learn something.
        
       | theanonymousone wrote:
       | I remember that, the way I understood it, Transformers solved two
       | major "issues" of RNNs that enabled the later boom: Vanishing
       | gradients limiting the context (and model?) size and difficulty
       | in parallelisation limiting the size of the training data.
       | 
       | Do we have solutions for these two problems now?
        
         | ebalit wrote:
         | Transformers can also fetch at any moment any previous
         | information that _become useful_.
         | 
         | RNN are constantly updating and overwriting their memory. It
         | means they need to be able to predict what is going to be
         | useful in order to store it for later.
         | 
         | This is a massive advantage for Transformers in interactive use
         | cases like in ChatGPT. You give it context and ask questions in
         | multiple turns. Which part of the context was important for a
         | given question only becomes known later in the token sequence.
         | 
         | To be more precise, I should say it's an advantage of
         | Attention-based models, because there are also hybrid models
         | successfully mixing both approaches, like Jamba.
        
           | visarga wrote:
           | You could theoretically run the input twice, allowing the
           | model to correlate later tokens with previous ones. It would
           | fix the problem with not knowing what information to retain.
           | A more complicated approach would train the RNN to request
           | replaying some earlier data when needed.
           | 
           | A great thing about RNNs is they can easily fork the state
           | and generate trees, it would be possible to backtrack and
           | work on combinatorial search problems.
           | 
           | Also easier to cache demonstrations for free in the initial
           | state, a model that has seen lots of data is not using more
           | memory than a model starting from scratch.
        
             | imjonse wrote:
             | Something like this?
             | 
             | https://hazyresearch.stanford.edu/blog/2024-07-01-jrt
        
               | visarga wrote:
               | Yes, that's the paper.
        
         | YeGoblynQueenne wrote:
         | Vanishing (or exploding) gradients affected all deep
         | architectures, not just RNNs. They were solved by LSTMs first
         | proposed in 1997. See:
         | 
         | https://www.semanticscholar.org/paper/Long-Short-Term-Memory...
         | 
         | I find it interesting that this knowledge seems to be all but
         | forgotten now. Back in the day, ca. 2014, LSTMs were all the
         | rage, e.g. see:
         | 
         | https://karpathy.github.io/2015/05/21/rnn-effectiveness/
         | 
         | https://colah.github.io/posts/2015-08-Understanding-LSTMs/
        
           | aDyslecticCrow wrote:
           | LSTM and GRU did not quite solve the issue, but they made it
           | less bad. Overall, recurrent units are nutritiously prone to
           | vanishing and exploding gradients.
           | 
           | I don't want to downplay the value of these models. Some
           | people seem to be under the perception that transformers
           | replaced or made them obsolete, which is faar from the truth.
        
           | jszymborski wrote:
           | > They were solved by LSTMs first proposed in 1997.
           | 
           | I see this stuff everywhere online and it's often taught this
           | way so I don't blame folks for repeating it, but I think it's
           | likely promulgated by folks who don't train LSTMs with long
           | contexts.
           | 
           | LSTMs do add something like a "skip-connection" (before that
           | term was a thing) which helps deal with the catastrophic
           | vanishing gradients you get from e.g. Jordan RNNs right from
           | the jump.
           | 
           | However (!), while this stops us from seeing vanishing
           | gradients after e.g. 10s or 100s of time-steps, when you
           | start seeing multiple 1000s of tokens, the wheels start
           | falling off. I saw this in my own research, training on amino
           | acid sequences of 3,000 length led to a huge amount of
           | instability. It was only after tokenizing the amino acid
           | sequences (which was uncommon at the time) which got us down
           | to ~1500 timesteps on average, did we start seeing stable
           | losses at training. Check-out the ablation at [0].
           | 
           | You can think of ResNets by analogy. ResNets didn't "solve"
           | vanishing gradients, there's a practical limit of the depth
           | of networks, but it did go a long way towards dealing with
           | it.
           | 
           | EDIT: I wanted to add, while I was trying to troubleshoot
           | this for myself, it was super hard to find evidence online of
           | why I was seeing instability. Everything pertaining to
           | "vanishing gradients" and LSTMs were blog posts and pre-
           | prints which just merrily repeated "LSTMs solve the problem
           | of vanishing gradients". That made it hard for me, a junior
           | PhD at the time, to suss out the fact that LSTMs do
           | demonstrably and reliably suffer from vanishing gradients at
           | longer contexts.
           | 
           | [0] https://academic.oup.com/bioinformatics/article/38/16/395
           | 8/6...
        
             | jph00 wrote:
             | Highway networks add a skip connection, but LSTMs don't.
             | Btw you might be interested in truncated backprop thru
             | time, which we introduced in our ULMFiT paper.
        
               | jszymborski wrote:
               | I was referring to how the context vectors help avoid
               | vanishing gradients by behaving very similarly to skip-
               | connections, but yes, they aren't skip-connections as-
               | such. That's been my understanding, at least.
               | 
               | We haven't tried truncated BPTT, but we certainly should.
               | 
               | Funnily enough, we adopted AWD-LSTMs, Ranger21, and Mish
               | in the paper I linked after I heard about them through
               | the fast.ai community (we also trialled QRNNs for a bit
               | too). fast.ai has been hugely influential in my work.
        
           | twobitshifter wrote:
           | Agreed, Ilya Sutskever himself has spent a long time with
           | lstm and published papers like this one while working at
           | Google. http://proceedings.mlr.press/v37/jozefowicz15.pdf
           | 
           | Recent comments from him have said that any architecture can
           | achieve transformer accuracy and recall, but we have devoted
           | energy to refining transformers, due to the early successes.
        
         | aDyslecticCrow wrote:
         | From my (admittedly loose) reading of the paper, this paper
         | particularly targets parallelization and fast training, not
         | "vanishing gradients." However, by simplifying the recurrent
         | units, they managed to improve both!
         | 
         | This is very clever and very interesting. The paper
         | continuously calls it a "decade-old architecture," but in
         | practice, it's still used massively, thanks to its simplicity
         | in adapting to different domains. Placing it as a "competitor"
         | to transformers is also not quite fully fair, as transformers
         | and RNNs are not mutually exclusive, and there are many methods
         | that merge them.
         | 
         | Improvement in RNNs is an improvement in a lot of other
         | surprising places. A very interesting read.
        
       | lccerina wrote:
       | "Was all along a scheme by Google to sell more tensor processing
       | units that didn't run RNNs well?"
        
       | scotty79 wrote:
       | The only strength of transformers is that they can run once for
       | each token and they can pass to themselves intermediate state as
       | they solve your problems. They have to conceal it in tokens that
       | look to humans like a part of the response.
       | 
       | It's obvious why the newest toy from openai can solve problems
       | better mostly by just being allowed to "talk to itself" for a
       | moment before starting the answer that human sees.
       | 
       | Given that, modern incarnation of RNN can be vastly cheaper than
       | transformers provided that they can be trained.
       | 
       | Convolutional neural networks get more visual understanding by
       | "reusing" their capacity across the area of the image. RNN's and
       | transformers can have better understanding of a given problem by
       | "reusing" their capacity to learn and infer across time (across
       | steps of iterative process really).
       | 
       | When it comes to transformer architecture the attention is a red
       | herring. It's just more or less arbitrary way to partition the
       | network so it can be parallelized. The only bit of potential
       | magic is with "shortcut" links between non adjacent layers that
       | help propagate learning back through many layers.
       | 
       | Basically the optimal network is deep, dense (all neurons connect
       | with all belonging to all preceding layers) that is ran in some
       | form of recurrence.
       | 
       | But we don't have enough compute to train that. So we need to
       | arbitrarily sever some connections so the whole thing is easier
       | to parallelized. It really doesn't matter which unless we do in
       | some obviously stupid way.
       | 
       | Actual inventive magic part of LLMs possibly happens in token and
       | positional encoders.
        
       | tadala wrote:
       | Everyone wants to use less compute to fit more in, but
       | (obviously?) the solution will be to use more compute and fit
       | less. Attention isn't (topologically) attentive enough. All these
       | RNN-lite approaches are doomed, beyond saving costs, they're
       | going to get cooked by some other arch--even more expensive than
       | transformers.
        
         | falcor84 wrote:
         | Would you mind expanding upon your thesis? If that compute and
         | all those parameters aren't "fitting" the training examples,
         | what is it that the model is learning, and how should that be
         | analyzed?
        
           | ithkuil wrote:
           | I think there are two distinct areas. One is the building of
           | the representations, which is achieved by fitting. The other
           | area is loosely defined as "computing" which is some kind of
           | searching for a path through representation space. All of
           | that is wrapped in a translation layer that can turn those
           | representations into stuff we humans can understand and
           | interact with. All of that is achieved to some extent by
           | current transformer architectures, but I guess some believe
           | that they are not quite as effective at the
           | "computation/search" stage.
        
             | falcor84 wrote:
             | But how does it get good at "computing"? The way I see it,
             | we either program them to do so manually, or we use ML, at
             | which case the model "fits" the computation based on
             | training examples or environmental feedback, no? What am I
             | missing?
        
               | ithkuil wrote:
               | the distinction is fuzzy indeed, especially if any thing
               | that you "program in manually" has some parameters that
               | are learned.
               | 
               | Conceptually we already have parts of the model that are
               | not learned: the architecture of the model itself.
        
       | Sysreq2 wrote:
       | Guys, I'm gonna stop this before it gets out of hand: All we need
       | is love and a shit ton of compute.
       | 
       | Everything else is just details.
        
       ___________________________________________________________________
       (page generated 2024-10-04 23:01 UTC)