[HN Gopher] Were RNNs all we needed?
___________________________________________________________________
Were RNNs all we needed?
Author : beefman
Score : 470 points
Date : 2024-10-03 17:31 UTC (1 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| tehsauce wrote:
| I haven't gone through the paper in detail yet but maybe someone
| can answer. If you remove the hidden state from an rnn as they
| say they've done, what's left? An mlp predicting from a single
| token?
| statusfailed wrote:
| I only had a quick look, but it looks like they tweaked the
| state update so the model can be run with parallel scan instead
| of having to do it sequentially.
| jfcoa wrote:
| It doesn't completely remove it, it removes certain
| dependencies on it so that it can be computed by parallel scan,
| there is still a hidden state. It bears some similarity to what
| was done with Mamba.
| bunderbunder wrote:
| They didn't remove the hidden state entirely, they just removed
| it from the input, forget and update gates. I haven't digested
| the paper either, but I think that in the case of a GRU this
| means that the hidden state update masking (z_t and r_t in the
| paper's formulas) only depends on the new input, not the input
| plus the prior hidden state.
| _0ffh wrote:
| The trick is to make sure the recursive dependency stays
| linear, that's how you enable parallel training.
| hydrolox wrote:
| Betteridge's law of headlines?
| woah wrote:
| For paper titles, the law is that the answer is always "yes"
| bunderbunder wrote:
| Not always, I think?
|
| Opinions probably differ, for example, on John Backus's paper
| "Can programming be liberated from the Von Neumann style?"
| Many fans of functional programming would say the answer is
| yes, but Backus himself expressed less enthusiasm in
| interviews later in his life.
|
| I think the important point, though, is that academic papers
| and newspaper articles are _not the same_ , and titles in the
| form of questions function differently in the two domains.
| Journalists tend to use titles like these to dissemble and
| sensationalize. When academics use these kinds of titles for
| peer-reviewed articles, it's because they really are asking
| an honest question. Backus was doing it in his paper. The
| authors of this paper are doing the same. They end the paper
| by re-iterating the question before launching into a
| discussion of the limitations that prevent them from reaching
| any firm conclusions on the answer to this question.
| nephanth wrote:
| More like "we aren't sure, but we have good reasons not to
| exclude the possibility"
| hiddencost wrote:
| Note Yoshua Bengio in the author list. This shouldn't be taken
| lightly.
| auggierose wrote:
| And this is where science breaks down.
| hotspot_one wrote:
| Not really, because
|
| 1) Yoshua's reputation would take a hit if this paper were
| bullshit, so he has extrinsic motivation to make it good 2)
| Yoshua has enough experience in the field to know what is
| going on in the field, you don't have to ask if he forgot
| about a certain architecture or the work of a certain
| research group which would contradict his findings-- if such
| work exists and is credible, it is very likely to be
| discussed in the paper. 3) This test answers something a
| leader in the field thinks is important enough for them to
| work on, else he wouldn't be involved.
|
| Also note, the poster said the paper shouldn't be taken
| lightly. That doesn't mean we need to take it blindly. It
| only means we cannot dismiss it out of hand, if we have a
| different view we would need substantive arguments to defend
| our view.
|
| I've overturned the field leader several times in science,
| but that's only because I acknowledged what they got right
| and that they were indeed the person who got it right.
| DAGdug wrote:
| " I've overturned the field leader several times in
| science" Either that makes you a field leader yourself, or
| you did it for trivial things, or you're BSing. Which one
| is it?
| exe34 wrote:
| there's a big space between leader and trivial. it's
| entirely possible to point out the top leader in your
| field is wrong on ten things over a career, without
| becoming the top leader yourself.
| DAGdug wrote:
| On speculative things or trivial things, sure! On
| substantive matters (recall: the choice of words is
| "overturned"), in empirical realms or theory (physics,
| CS) or math, it's rather doubtful. Anonymous, self-
| declared geniuses aren't to be taken at face value.
| exe34 wrote:
| > Anonymous, self-declared geniuses aren't to be taken at
| face value.
|
| no, that would be a grievous mistake on an anonymous
| site.
| auggierose wrote:
| > It only means we cannot dismiss it out of hand, if we
| have a different view we would need substantive arguments
| to defend our view.
|
| You will need to do that anyway, no matter if Yoshua is on
| the paper, or not. I understand that people have limited
| bandwidth, and so they need shortcuts, and they need to
| justify these shortcuts to themselves somehow (of course
| the justifications are nonsense). Maybe AI will help here.
| _giorgio_ wrote:
| Who cares. Look at Jeffrey Hinton right now. Do you trust him?
| :-D
| imjonse wrote:
| To their credit, the authors (Y. Bengio among them) end the paper
| with the question, not suggesting they know the answer. These
| models are very small even by academic standards so any finding
| would not necessarily extend to current LLM scales. The main
| conclusion is that RNN class networks can be trained as
| efficiently as modern alternatives but the resulting performance
| is only competitive at small scale.
| phkahler wrote:
| >> These models are very small even by academic standards so
| any finding would _not necessarily_ extend to current LLM
| scales.
|
| Emphasis on not necessarily.
|
| >> The main conclusion is that RNN class networks can be
| trained as efficiently as modern alternatives but the resulting
| performance is only competitive at small scale.
|
| Shouldn't the conclusion be "the resulting competitive
| performance has only been confirmed at small scale"?
| imjonse wrote:
| yes, that is clearer indeed. However S4 and Mamba class
| models have also performed well at small scale and started
| lagging with larger models and larger context sizes, or at
| particular tasks.
| xnx wrote:
| It's curse and a blessing that discussion of topics happens in so
| many different places. I found this comment on Twitter/X
| interesting: https://x.com/fchollet/status/1841902521717293273
|
| "Interesting work on reviving RNNs.
| https://arxiv.org/abs/2410.01201 -- in general the fact that
| there are many recent architectures coming from different
| directions that roughly match Transformers is proof that
| architectures aren't fundamentally important in the curve-fitting
| paradigm (aka deep learning)
|
| Curve-fitting is about embedding a dataset on a curve. The
| critical factor is the dataset, not the specific hard-coded bells
| and whistles that constrain the curve's shape. As long as your
| curve is sufficiently expressive all architectures will converge
| to the same performance in the large-data regime."
| islewis wrote:
| > "As long as your curve is sufficiently expressive all
| architectures will converge to the same performance in the
| large-data regime."
|
| I haven't fully ingested the paper yet, but it looks like it's
| focused more on compute optimization than the size of the
| dataset:
|
| > ... and (2) are fully parallelizable during training (175x
| faster for a sequence of length 512
|
| Even if many types of architectures converge to the same loss
| over time, finding the one that converges the fastest is quite
| valuable given the cost of running GPU's at scale.
| teruakohatu wrote:
| > Even if many types of architectures converge to the same
| loss over time, finding the one that converges the fastest is
| quite valuable given the cost of running GPU's at scale.
|
| This! Not just fastest but with the lowest resources in
| total.
|
| Fully connected neural networks are universal functions.
| Technically we don't need anything but a FNN, but memory
| requirements and speed would be abysmal far beyond the realm
| of practicality.
| actionfromafar wrote:
| Unless we could build chips in 3D?
| foota wrote:
| Not even then, a truly fully connected network would have
| super exponential runtime (it would take N^N time to
| evaluate)
| ivan_gammel wrote:
| We need quantum computing there. I remember seeing a
| recent article about quantum processes in the brain. If
| that's true, QC may be the missing part.
| eru wrote:
| Compare and contrast https://www.smbc-
| comics.com/comic/the-talk-3
|
| (Summary: quantum computing is unlikely to help.)
| tsimionescu wrote:
| This is just word salad.
|
| There is no known quantum algorithm that can compute the
| result of a fully-connected neural network exponentially
| faster than classical computers can. QCs have a known
| exponential advantage over classical computers only for a
| very limited class of problems, mostly related to the
| Quantum Fourier Transform.
|
| Animal brains have little to nothing in common to
| artifical neural networks. There is no reason whatsoever
| to think that there is any relation between the
| complexity class of brain functions and ANN inference.
|
| And the hypothesized (and still wildly speculative)
| quantum behaviors happening in the animal brain are at
| the level of the behavior of individual neurons, not of
| the network connections between neurons. So even if there
| is some kind of quantum computation happening, it's
| happening in individual neurons, not at the network
| level, and that would only go to show even more that
| animal brains are profoundly different from ANNs.
| mvkel wrote:
| Wetware is the future.
| fennecfoxy wrote:
| Can't wait to see this defiantly spray painted across a
| torn up brick wall while computronium brained super
| intelligences slowly disassemble our planet to make
| paperclips.
| ComputerGuru wrote:
| Heat extraction.
| bob1029 wrote:
| We are already doing this.
| byearthithatius wrote:
| > finding the one that converges the fastest is quite
| valuable given the cost of running GPU's at scale
|
| Not to him, he runs the ARC challenge. He wants a new
| approach entirely. Something capable of few-shot learning out
| of distribution patterns .... somehow
| acchow wrote:
| What it will come down to is computational efficiencies. We
| don't want to retrain once a month - we want to retrain
| continuously. We don't want one agent talking to 5 LLMs. We
| want thousands of LLMs all working in concert.
| ActorNightly wrote:
| This and also the way models are trained has to be rethought.
| BPP is good for figuring out complex function mappings, but
| not for storing information.
| pbhjpbhj wrote:
| Sounds like something that has unsustainable energy costs.
| Lerc wrote:
| I remember one of the initial transformer people saying in an
| interview that they didn't think this was the "one true
| architecture" but a lot of the performance came from people
| rallying around it and pushing in the one direction.
|
| On the other hand, while _" As long as your curve is
| sufficiently expressive all architectures will converge to the
| same performance in the large-data regime."_ is true, a
| sufficiently expressive mechanism may not be computationally or
| memory efficient. As both are constraints on what you can
| actually build, it's not whether the architecture can produce
| the result, but whether a feasible/practical instantiation of
| that architecture can produce the result.
| viktor_von wrote:
| > I remember one of the initial transformer people saying in
| an interview that they didn't think this was the "one true
| architecture" but a lot of the performance came from people
| rallying around it and pushing in the one direction.
|
| You may be referring to Aidan Gomez (CEO of Cohere and
| contributor to the transformer architecture) during his
| Machine Learning Street Talk podcast interview. I agree, if
| as much attention had been put towards the RNN during the
| initial transformer hype, we may have very well seen these
| advancements earlier.
| ants_everywhere wrote:
| > is proof that architectures aren't fundamentally important in
| the curve-fitting paradigm (aka deep learning)
|
| (Somewhat) fun and (somewhat) related fact: there's a whole
| cottage industry of "is all you need" papers
| https://arxiv.org/search/?query=%22is+all+you+need%22&search...
| TaurenHunter wrote:
| Reminds me of the "Considered Harmful" articles:
|
| https://meyerweb.com/eric/comment/chech.html
| jprete wrote:
| I wonder if there's something about tech culture - or tech
| people - that encourages them to really, really like
| snowclones.
| observationist wrote:
| Yes. Do stuff that other people have been successful
| doing. Monkey see, monkey do - it's not a tech people
| thing, it's a human thing.
|
| Tech just happens to be most on display at the moment -
| because tech people are building the tools and the
| parameters and the infrastructure handling all our
| interactions.
| fennecfoxy wrote:
| Not sure why people are surprised about this when it's
| the modus operandi of all life on the planet.
|
| I could spam we are the stochastic parrots after all, yet
| one more time.
| bee_rider wrote:
| Quick, somebody write "All you need Considered Harmful" and
| "Considered Harmful all you need."
|
| Which seems closer to true?
| cozzyd wrote:
| All you need is all you need.
| tsimionescu wrote:
| Starting of course with the classic paper from Lennon and
| McCartney, 1967.
| wongarsu wrote:
| One big thing that bells and whistles do is limit the training
| space.
|
| For example when CNNs took over computer vision that wasn't
| because they were doing something that dense networks couldn't
| do. It was because they removed a lot of edges that didn't
| really matter, allowing us to spend our training budget on
| deeper networks. Similarly transformers are great because they
| allow us to train gigantic networks somewhat efficiently. And
| this paper finds that if we make RNNs a lot faster to train
| they are actually pretty good. Training speed and efficiency
| remains the big bottleneck, not the actual expressiveness of
| the architecture
| nutanc wrote:
| This is true. This is the reason, in many of our experiments
| we find that using a new algorithm, KESieve, we actually find
| the planes much faster than the traditional deep learning
| training approaches. The premise is, a neaural network builds
| planes which separate the data and adjusts these planes
| through an iterative learning process. What if we can find a
| non iterative method which can draw these same planes. We
| have been trying this and so far we have been able to replace
| most network layers using this approach. haven't tried for
| transformers though yet.
|
| Some links if interested:
|
| [1] https://gpt3experiments.substack.com/p/understanding-
| neural-...
|
| [2] https://gpt3experiments.substack.com/p/building-a-vector-
| dat...
| dheera wrote:
| I mean, transformer-based LLMs are RNNs, just really really
| really big ones with very wide inputs that maintain large
| amounts of context.
| immibis wrote:
| No. An RNN has an arbitrarily-long path from old inputs to
| new outputs, even if in practice it can't exploit that path.
| Transformers have fixed-size input windows.
| og_kalu wrote:
| You can't have a fixed state and have arbitrarily-long path
| from input. Well you can but then it's just meaningless
| because you fundamentally cannot keep stuffing information
| of arbitrary length into a fixed state. RNNs effectively
| have fixed-size input windows.
| immibis wrote:
| The _path_ is arbitrarily _long_ , not wide. It is
| _possible_ for an RNN to be made that remembers the first
| word of the input, no longer how long the input is. This
| is not possible with a transformer, so we know they are
| fundamentally different.
| quotemstr wrote:
| But an RNN isn't _going_ to remember the first token of
| input. It won 't know until it sees the last token
| whether that first token was relevant after all, so it
| has to learn token-specific update rules that let it
| guess how long to hold what kinds of information. (In
| multi-layer systems, the network uses ineffable
| abstractions rather than tokens, but the same idea
| applies.)
|
| What the RNN must be doing reminds me of "sliding window
| attention" --- the model learns how to partition its
| state between short- and long-range memories to minimize
| overall loss. The two approaches seem related, perhaps
| even equivalent up to implementation details.
| OkayPhysicist wrote:
| The most popular RNNs (the ones that were successful
| enough for Google translate and the like) actually had
| this behavior baked in to the architecture, called
| "LSTMs", "Long-Short Term Memory"
| dheera wrote:
| A chunk of the output still goes into the transformer
| input, so the arbitrarily-long path still exists, it just
| goes through a decoding/encoding step.
| WithinReason wrote:
| no, you can give as much context to a transformer as you
| want, you just run out of memory
| immibis wrote:
| An RNN doesn't run out of memory from that, so they are
| still fundamentally different.
|
| How do you encode arbitrarily long positions, anyway?
| WithinReason wrote:
| They are different but transformers don't have fixed
| windows, you can extend the context or make it smaller. I
| think you can extend a positional encoding if it's not a
| learned encoding.
| quantadev wrote:
| Most LLMs aren't even using a "curve" yet at all, right? All
| they're using is a series of linear equations because the model
| weights are a simple multiply and add (i.e. basic NN
| Perceptron). Sure there's a squashing function on the output to
| keep it in a range from 0 to 1 but that's done BECAUSE we're
| just adding up stuff.
|
| I think probably future NNs will be maybe more adaptive than
| this perhaps where some Perceptrons use sine wave functions, or
| other kinds of math functions, beyond just linear "y=mx+b"
|
| It's astounding that we DID get the emergent intelligence from
| just doing this "curve fitting" onto "lines" rather than actual
| "curves".
| OkayPhysicist wrote:
| The "squashing function" necessarily is nonlinear in
| multilayer nueral networks. A single layer of a neural
| network can be quite simply written a weight matrix, times an
| input vector, equalling an output vector, like so
|
| Ax = y
|
| Adding another layer is just multiplying a different set of
| weights times the output of the first, so
|
| B(Ax)= y
|
| If you remember your linear algebra course, you might see the
| problem: that can be simplified
|
| (BA)x = y
|
| Cx = y
|
| Completely indistinguishable from a single layer, thus only
| capable of modeling linear relationships.
|
| To prevent this collapse, a non linear function must be
| introduced between each layer.
| quantadev wrote:
| Right. All the squashing is doing is keeping the output of
| any neuron in a range of below 1.
|
| But the entire NN itself (Perceptron ones, which most LLMs
| are) is still completely using nothing but linearity to
| store all the knowledge from the training process. All the
| weights are just an 'm' in the basic line equation
| 'y=m*x+b'. The entire training process does nothing but
| adjust a bunch of slopes of a bunch of lines. It's totally
| linear. No non-linearity at all.
| nazgul17 wrote:
| The non linearities are fundamental. Without them, any
| arbitrarily deep NN is equivalent to a shallow NN (easily
| computable, as GP was saying), and we know those can't
| even solve the XOR problem.
|
| > nothing but linearity
|
| No, if you have non linearities, the NN itself is _not_
| linear. The non linearities are not there primarily to
| keep the outputs in a given range, though that 's
| important, too.
| quantadev wrote:
| > The non linearities are not there primarily to keep the
| outputs in a given range
|
| Precisely what the `Activation Function` does is to
| squash an output into a range (normally below one, like
| tanh). That's the only non-linearity I'm aware of. What
| other non-linearities are there?
|
| All the training does is adjust linear weights tho, like
| I said. All the training is doing is adjusting the slopes
| of lines.
| jcparkyn wrote:
| > squash an output into a range
|
| This isn't the primary purpose of the activation
| function, and in fact it's not even necessary. For
| example see ReLU (probably the most common activation
| function), leaky ReLU, or for a sillier example:
| https://youtu.be/Ae9EKCyI1xU?si=KgjhMrOsFEVo2yCe
| quantadev wrote:
| You can change the subject by bringing up as many
| different NN architectures, Activation Functions, etc. as
| you want. I'm telling you the basic NN Perceptron design
| (what everyone means when they refer to Perceptrons in
| general), has something like a `tanh` and not only is
| it's PRIMARY function to squash a number, that's it's
| ONLY function.
| beckhamc wrote:
| How was that person derailing the convo? Nothing says an
| activation function has to "squash" a number to be in
| some range. Leaky ReLUs for instance do `f(x) = x if x >
| 0 else ax` (for some coefficient `a != 0`), that doesn't
| squash `x` to be in any range (unless you want to be
| peculiar about your precise definition of what it means
| to squash a number). The function takes a real in `[-inf,
| inf]` and produces a number in `[-inf, inf]`.
|
| > Sure there's a squashing function on the output to keep
| it in a range from 0 to 1 but that's done BECAUSE we're
| just adding up stuff.
|
| It's not because you're "adding up stuff", there is
| specific mathematical or statistical reason why it is
| used. For neural networks it's there to stop your multi
| layer network collapsing to a single layer one (i.e. a
| linear algebra reason). You can choose whatever function
| you want, for hidden layers tanh generally isn't used
| anymore, it's usually some variant of a ReLU. In fact
| Leaky ReLUs are very commonly used so OP isn't changing
| the subject.
|
| If you define a "perceptron" (`g(Wx+b)` and `W` is a
| `Px1` matrix) and train it as a logistic regression model
| then you want `g` to be sigmoid. Its purpose is to ensure
| that the output can be interpreted as a probability
| (given that use the correct statistical loss), which
| means squashing the number. The inverse isn't true, if I
| take random numbers from the internet and squash them to
| `[0,1]` I don't go call them probabilities.
|
| > and not only is it's PRIMARY function to squash a
| number, that's it's ONLY function.
|
| Squashing the number isn't the reason, it's the side
| effect. And even then, I just said that not all
| activation functions squash numbers.
|
| > All the training does is adjust linear weights tho,
| like I said.
|
| Not sure what your point is. What is a "linear weight"?
|
| We call layers of the form `g(Wx+b)` "linear" layers but
| that's an abused term, if g() is non-linear then the
| output is not linear. Who cares if the inner term `Wx +
| b` is linear? With enough of these layers you can
| approximate fairly complicated functions. If you're
| arguing as to whether there is a better fundamental
| building block then that is another discussion.
| quantadev wrote:
| > What is a "linear weight"?
|
| In the context of discussing linearity v.s. non-linearity
| adding the word "linear" in front of "weight" is more
| clear, which is what my top level post on this thread was
| all about too.
|
| It's astounding to me (and everyone else who's being
| honest) that LLMs can accomplish what they do when it's
| only linear "factors" (i.e. weights) that are all that's
| required to be adjusted during training, to achieve
| genuine reasoning. During training we're not [normally]
| adjusting any parameters or weights on any non-linear
| functions. I include the caveat "normally", because I'm
| speaking of the basic Perceptron NN using a squashing-
| type activation function.
| viktor_von wrote:
| > It's astounding to me (and everyone else who's being
| honest) that LLMs can accomplish what they do when it's
| only linear "factors" (i.e. weights) that are all that's
| required to be adjusted during training, to achieve
| genuine reasoning.
|
| When such basic perceptrons are scaled enormously, it
| becomes less surprising that they can achieve some level
| of 'genuine reasoning' (e.g., accurate next-word
| prediction), since the goal with such networks at the end
| of the day is just function approximation. What is more
| surprising to me is how we found ways to train such
| models i.e., advances in hardware accelerators, combined
| with massive data, which are factors just as significant
| in my opinion.
| quantadev wrote:
| Yeah, no one is surprised that LLMs do what they're
| trained to do: predict tokens. The surprise comes from
| the fact that merely training to predict tokens ends up
| with model weights that generate emergent reasoning.
|
| If you want to say reasoning and token prediction are
| just the same thing at scale you can say that, but I
| don't fall into that camp. I think there's MUCH more to
| learn, and indeed a new field of math or even physics
| that we haven't even discovered yet. Like a step change
| in mathematical understanding analogous to the invention
| of Calculus.
| mr_toad wrote:
| You need a non-linear activation function for the
| universal approximation theorem to hold. Otherwise, as
| others have said the model just collapses to a single
| layer.
|
| Technically the output is still what a statistician would
| call "linear in the parameters", but due to the universal
| approximation theorem it can _approximate_ any non-linear
| function.
|
| https://stats.stackexchange.com/questions/275358/why-is-
| incr...
| quantadev wrote:
| As you can see in what I just posted about an inch below
| this, my point is that the process of training a NN does
| not involve adjusting any parameter to any non-linear
| functions. What goes into an activation function is a
| pure sum of linear multiplications and an add, but
| there's no "tunable" parameter (i.e. adjusted during
| training) that's fed into the activation function.
| beckhamc wrote:
| Learnable parameters on activations _do_ exist, look up
| parametric activation functions.
| quantadev wrote:
| If course they do exist. A parameterized activation
| function is the most obvious thing to _try_ in NN design,
| and has certainly been invented /studied by 1000s of
| researchers.
| uh_uh wrote:
| > That's the only non-linearity I'm aware of.
|
| "only" is doing a lot work here because that non-
| linearity is enough to vastly expand the landscape of
| functions that an NN can approximate. If the NN was
| linear, you could greatly simplify the computational
| needs of the whole thing (as was implied by another
| commenter above) but you'd also not get a GPT out of it.
| quantadev wrote:
| All the trainable parameters are just slopes of lines
| tho. Training NNs doesn't involve adjusting any inputs to
| non-linear functions. The tanh smashing function just
| makes sure nothing can blow up into large numbers and all
| outputs are in a range of less than 1. There's no "magic"
| or "knowledge" in the tanh smashing. All the magic is
| 100% in the weights. They're all linear. The amazing
| thing is that all weights are linear slopes of lines.
| Nevermark wrote:
| Simply squashing the output of a linear signal would be
| multiplying by a small value. To avoid large y, you add a
| step y' = y/1000.
|
| That would still be linear. And the result would be that
| despite squashing, no matter how many layers a model had,
| it could only fit linear problems. Which can always be
| fit with a single layer, i.e. single matrix.
|
| So nobody does that.
|
| The nonlinearity doesn't just squash some inputs. But
| create a new rich feature, decision making. That's
| because on one side of a threshold y gets converted very
| differently than another. I.e if y > 0, y' = y, otherwise
| y = 0.
|
| Now you have a discontinuity in behavior, you have a
| decision.
|
| Multiple layers making decisions can do far more than a
| linear layer. They can fit any continuous function (or
| any function with a finite number of discontinuities)
| arbitrarily well.
|
| Non-linearities add a fundamental new feature. You can
| think of that features as being able to make decisions
| around the non-linear function's decision points.
|
| ---
|
| If you need to prove this to yourself with a simple
| example, try to create an XOR gate with this function:
| y = w1 * x1 + w2 * x2 + b.
|
| Where you can pick w1, w2 and b.
|
| You are welcome to linearly squash the output, i.e. y' =
| y * w3, for whatever small w3 you like. It won't help.
|
| Layers with non-linear transformations are layers of
| decision makers.
|
| Layers of linear transforms are just unnecessarily long
| ways of writing a single linear transform. Even with
| linear "squashing".
| quantadev wrote:
| Right, it's obvious that the ReLU is just a gating
| mechanism, and you can think of that as a decision maker.
| It's like a "pass thru linearly proportionally" or
| "block" function.
|
| But I still find it counter-intuitive that it's not
| common practice in standard LLM NNs to have a trainable
| parameter that in some way directly "tunes" whatever
| Activation Function is being applied on EACH output.
|
| For example I almost started experimenting with
| trigonometric activation functions in a custom NN where
| the phase angle would be adjusted, inspired by Fourier
| Series. I can envision a type of NN where every model
| "weight" is actually a frequency component, because
| Fourier Series can represent any arbitrary function in
| this way. There has of course already been similar
| research done by others along these lines.
| uh_uh wrote:
| > The tanh smashing function just makes sure nothing can
| blow up into large numbers and all outputs are in a range
| of less than 1.
|
| That's not the main point even though it probably helps.
| As OkayPhysicist said above, without a nonlinearity, you
| could collapse all the weight matrices into a single
| matrix. If you have 2 layers (same size, for simplicity)
| described by weight matrices A and B, you could multiply
| them and get C, which you could use for inference.
|
| Now, you can do this same trick not only with 2 layers
| but 100 million, all collapsing into a single matrix
| after multiplication. If the nonlinearities weren't
| there, the effective information content of the whole NN
| would collapse into that of a single-layer NN.
| quantadev wrote:
| You can explain the "effect" of tanh at any level of
| abstraction you like, up to including describing things
| that happen in Semantic Space itself, but my description
| of what tanh is doing is 100% accurate in the context I
| used it. All it's doing is squashing a number down to
| below one. My understanding of how the Perceptron works
| is fully correct, and isn't missing any details. I've
| implemented many of them.
| beckhamc wrote:
| Your description of tanh isn't even correct, it squashes
| a real number to `(-1, 1)`, not "less than one".
|
| You're curious about whether there is gain in
| parameterising activation functions and learning them
| instead, or rather, why it's not used much in practice.
| That's an interesting and curious academic question, and
| it seems like you're already experimenting with trying
| out your own kinds of activation functions. However,
| people in this thread (including myself) wanted to
| clarify some perceived misunderstandings you had about
| nonlinearities and "why" they are used in DNNs. Or how
| "squashing functions" is a misnomer because `g(x) =
| x/1000` doesn't introduce any nonlinearities. Yet you
| continue to fixate and double down on your knowledge of
| "what" a tanh is, and even that is incorrect.
| quantadev wrote:
| When discussing `tanh squashing` among other AI experts
| it's generally assumed that even the most pedantic and
| uncharitable parsing of words won't be able to
| misinterpret "smashing to less than one" as an
| _incorrect_ sentence fragment, because the "one", in
| that context, obviously refers to distance from zero.
| wrs wrote:
| With a ReLU activation function, rather than a simple
| linear function of the inputs, you get a _piecewise
| linear approximation_ of a nonlinear function.
|
| ReLU enables this by being nonlinear in a simple way,
| specifically by outputting zero for negative inputs, so
| each linear unit can then limit its contribution to a
| portion of the output curve.
|
| (This is a lot easier to see on a whiteboard!)
| quantadev wrote:
| ReLU technically has a non-linearity at zero, but in some
| sense it's still even MORE linear than tanh or sigmoid,
| so it just demonstrates even better than tanh-type
| squashing that all this LLM stuff is being done
| ultimately with straight line math. All a ReLU function
| does is choose which line to use, a sloped one or a zero
| one.
| wrs wrote:
| Well. The word "linear" the way you use it doesn't seem
| to have any particular meaning, certainly not the
| standard mathematical meaning, so I'm not sure we can
| make further progress on this explanation.
|
| I'll just reiterate that the single "technical" (whatever
| that means) nonlinearity in ReLU is exactly what lets a
| layer approximate any continuous[*] function.
|
| [*] May have forgotten some more adjectives here needed
| for full precision.
| quantadev wrote:
| If you're confused just show a tanh graph and a ReLU
| graph to a 7 year old child and ask which one is linear.
| They'll all get it right. So you're not confused in the
| slightest bit about anything I've said. There's nothing
| even slightly confusing about saying a ReLU is made of
| two lines.
| mickg10 wrote:
| I.e. ReLU is _piecewise_ linear. That discontinuity that
| separates the 2 pieces is precisely what makes it non
| linear. Which is what enables the actual universal
| approximation.
| quantadev wrote:
| Which is what I said two replies ago.
|
| Followed by "in some sense it's [ReLU] still even MORE
| linear than tanh or sigmoid functions are". There's no
| way you misunderstood that sentence, or took it as my
| "definition" of linearity...so I guess you just wanted to
| reaffirm I was correct, again, so thanks.
| scarmig wrote:
| Nonlinearity somewhere is fundamental, but it doesn't
| need to be between each layer. You can, for instance,
| project each input to a higher dimensional space with a
| nonlinearity, and the problem becomes linearly separable
| with high probability (cf Cover's Theorem).
|
| So, for XOR, (x, y) -> (x, y, xy), and it becomes trivial
| for a linear NN to solve.
|
| Architectures like Mamba have a linear recurrent state
| space system as their core, so even though you need a
| nonlinearity somewhere, it doesn't need to be pervasive.
| And linear recurrent networks are surprisingly powerful
| (https://arxiv.org/abs/2303.06349,
| https://arxiv.org/abs/1802.03308).
| mr_toad wrote:
| > It's astounding that we DID get the emergent intelligence
| from just doing this "curve fitting" onto "lines" rather than
| actual "curves".
|
| In Ye Olden days (the 90's) we used to approximate non-linear
| models using splines or seperate slopes models - fit by hand.
| They were still linear, but with the right choice of splines
| you could approximate a non-linear model to whatever degree
| of accuracy you wanted.
|
| Neural networks "just" do this automatically, and faster.
| quantadev wrote:
| In college (BSME) I wrote a computer program to generate
| cam profiles from Bezier curves. It's just a programming
| trick to generate curves from straight lines at any level
| of accuracy you want just by letting the computer take
| smaller and smaller steps.
|
| It's an interesting concept to think of how NNs might be
| able to exploit this effect in some way based on straight
| lines in the weights, because a very small number of points
| can identify avery precise and smooth curves, where
| directions on the curve might equate to Semantic Space
| Vectors.
| quantadev wrote:
| In fact now that I think about it, for any 3 or more
| points in Semantic Space, there would necessarily be a
| "Bezier Path" which would have genuine meaning at every
| point as a good smooth differentiable path thru higher
| dimensional space to get from one point to another point
| while "visiting" all intermediate other points. This has
| to have a direct use in LLMs in terms of reasoning.
| sakras wrote:
| I figured this was pretty obvious given that MLPs are universal
| function approximators. A giant MLP could achieve the same
| results as a transformer. The problem is the scale - we can't
| train a big enough MLP. Transformers are a performance
| optimization, and that's why they're useful.
| ctur wrote:
| Architecture matters because while deep learning can
| conceivably fit a curve with a single, huge layer (in theory...
| Universal approximation theorem), the amount of compute and
| data needed to get there is prohibitive. Having a good
| architecture means the theoretical possibility of deep learning
| finding the right N dimensional curve becomes a practical
| reality.
|
| Another thing about the architecture is we inherently bias it
| with the way we structure the data. For instance, take a
| dataset of (car) traffic patterns. If you only track the date
| as a feature, you miss that some events follow not just the
| day-of-year pattern but also holiday patterns. You could learn
| this with deep learning with enough data, but if we bake it
| into the dataset, you can build a model on it _much_ simpler
| and faster.
|
| So, architecture matters. Data/feature representation matters.
| mr_toad wrote:
| > can conceivably fit a curve with a single, huge layer
|
| I think you need a hidden layer. I've never seen a universal
| approximation theorem for a single layer network.
| dongecko wrote:
| I second that thought. There is a pretty well cited paper
| from the late eighties called "Multilayer Feedforward
| Networks are Universal Approximators". It shows that a
| feedforward network with a single hidden layer containing a
| finite number of neurons can approximate any continuous
| function. For non continous function additional layers are
| needed.
| drodgers wrote:
| > The critical factor is the dataset, not the specific hard-
| coded bells and whistles that constrain the curve's shape
|
| I have almost the opposite take. We've had a lot of datasets
| for ages, but all the progress in the last decade has come from
| advances how curves are architected and fit to the dataset
| (including applying more computing power).
|
| Maybe there's some theoretical sense in which older models
| could have solved newer problems just as well if only we
| applied 1000000x the computing power, so the new models are
| 'just' an optimisation, but that's like dismissing the
| importance of complexity analysis in algorithm design, and thus
| insisting that bogosort and quicksort are equivalent.
|
| When you start layering in normalisation techniques to minimise
| overfitting, and especially once you start thinking about more
| agentic architectures (eg. Deep Q Learning, some of the search
| space design going into OpenAI's o1), then I don't think the
| just-an-optimisation perspective can hold much water at all -
| more computing power simply couldn't solve those problems with
| older architectures.
| eru wrote:
| I see what you are saying, and I made a similar comment.
|
| However it's still an interesting observation that many
| architectures can arrive at the same performance (even though
| the training requirements are different).
|
| Naively, you wouldn't expect eg 'x -> a * x + b' to fit the
| same data as 'x -> a * sin x + b' about equally well. But
| that's an observation from low dimensions. It seems once you
| add enough parameters, the exact model doesn't matter too
| much for practical expressiveness.
|
| I'm faintly reminded of the Church-Turing Thesis; the
| differences between different computing architectures are
| both 'real' but also 'just an optimisation'.
|
| > When you start layering in normalisation techniques to
| minimise overfitting, and especially once you start thinking
| about more agentic architectures (eg. Deep Q Learning, some
| of the search space design going into OpenAI's o1), then I
| don't think the just-an-optimisation perspective can hold
| much water at all - more computing power simply couldn't
| solve those problems with older architectures.
|
| You are right, these normalisation techniques help you
| economise on training data, not just on compute. Some of
| these techniques can be done independent of the model, eg
| augmenting your training data with noise. But some others are
| very model dependent.
|
| I'm not sure how the 'agentic' approaches fit here.
| refulgentis wrote:
| > _Naively, you wouldn 't expect_
|
| I, a nave, expected this.
|
| Is multiplication versus sine in the analogy hiding it,
| perhaps?
|
| I've always pictured it as just "needing to learn" the
| function terms and the function guts are an abstraction
| that is learned.
|
| Might just be because I'm a physics dropout with a bunch of
| whacky half-remembered probably-wrong stuff about how any
| function can be approximated by ex. fourier series.
| eru wrote:
| So (most) neural nets can be seen as a function of a
| _fixed_ form with some inputs and lots and lots of
| parameters.
|
| In my example, a and b were the parameters. The kinds of
| data you can approximate well with a simple sine wave and
| the kinds of data you can approximate with a straight
| line are rather different.
|
| Training your neural net only fiddles with the parameters
| like a and b. It doesn't do anything about the shape of
| the function. It doesn't change sine into multiplication
| etc.
|
| > [...] about how any function can be approximated by ex.
| fourier series.
|
| Fourier series are an interesting example to bring up! I
| think I see what you mean.
|
| In theory they work well to approximate any function over
| either a periodic domain or some finite interval. But
| unless you take special care, when you apply Fourier
| analysis naively it becomes extremely sensitive to errors
| in the phase parameters.
|
| (Special care could eg be done by hacking up your input
| domain into 'boxes'. That works well for eg audio or
| video compression, but gives up on any model
| generalisation between 'boxes', especially for what would
| happen in a later box.)
|
| Another interesting example is Taylor series. For many
| simple functions Taylor series are great, but for even
| moderately complicated ones you need to be careful. See
| eg how the Taylor serious for the logarithm around x=1
| works well, but if you tried it around x=0, you are in
| for a bad time.
|
| The interesting observation isn't just that there are
| multiple universal approximators, but that at high enough
| parameter count, they seem to perform about equally well
| in how good they are at approximating in practice (but
| differ in how well they can be trained).
| leereeves wrote:
| > Training your neural net only fiddles with the
| parameters like a and b. It doesn't do anything about the
| shape of the function. It doesn't change sine into
| multiplication etc.
|
| It definitely can. The output will always be piecewise
| linear (with ReLU), but the overall shape can change
| completely.
| ziofill wrote:
| You can fit any data with enough parameters. What's
| tricky is to constrain a model so that it approximates
| the ground truth well where there are no data points. If
| a family of functions is extremely flexible and can fit
| all kinds of data very efficiently I would argue it makes
| it harder for those functions to have correct values out
| of distribution.
| leereeves wrote:
| Definitely. That's a fundamental observation called the
| bias-variance tradeoff. More flexible models are prone to
| overfitting, hitting each training point exactly with
| wild gyrations in between.
|
| Big AI minimizes that problem by using more data. So much
| data that the model often only sees each data point once
| and overfitting is unlikely.
| ziofill wrote:
| But while keeping the data constant, adding more and more
| parameters is a strategy that works, so what gives? Are
| the functions getting somehow regularized during training
| so effectively you could get away with fewer parameters,
| it's just that we don't have the right model just yet?
| eru wrote:
| Sorry, when I meant 'shape' of the function, I meant the
| shape of the abstract syntax tree (or something like
| that).
|
| Not the shape of its graph when you draw it.
| refulgentis wrote:
| More directly than my first attempt: you're continuing
| the error here. The nave's approach of "it's
| approximating some function" both maps to reality and
| makes accurate predictions. The more we couple ourselves
| to "no no no, it's modeling a precise function", the more
| we end up wrong, both on how it works in theory and in
| practice.
| dboreham wrote:
| This reminds me of control systems theory where provided
| there's feedback, the forward transfer function doesn't
| matter beyond very basic properties around the origin.
| mirekrusin wrote:
| Isn't bogosort transformer and quicksort proposed modified
| rnn (175 times faster training for 500 seq) here?
| f1shy wrote:
| Wait! We certainly did NOT have huge datasets (like current
| internet) for ages. Not even decades. I've seen a lecture by
| a MIT professor (which I cannot find now) where he asserted
| categorically, that the advances in AI are mostly because of
| the huge data that we now have and we didn't before. And that
| was an _old_ video.
| yosefk wrote:
| Whichever way it's true in, it's not true in the sense that
| eg you can approximate any curve with a single layer neural
| net, and you're not actually going to be able to do it for
| problems CNNs or transformers work decently on. And Google
| indexed all of the public Internet way before its
| researchers came up with transformers.
|
| Another way to look at it is that like you say, it was an
| old video but there has been progress since though we had
| large datasets when it came out by its own definition
| tsimionescu wrote:
| I think by far the biggest advances are related to compute
| power. The amount of processing needed to run training
| algorithms on the amounts of data needed for the latest
| models was just not possible even five years ago, and
| definitely not ten years ago.
|
| I'm sure there are optimizations from the model shape as
| well, but I don't think that running the best algorithms we
| have today with hardware from five-ten years ago would have
| worked in any reasonable amount of time/money.
| freeqaz wrote:
| A 30bn param model, hell even a 7bn param model, is still
| incredibly useful and I feel like that could have been
| doable a decade ago!
|
| We have GPT-4 (or at least 3.5) tier performance in these
| much smaller models now. If we teleported back in time it
| may have been possible to build
| tsimionescu wrote:
| I think the size of the model is only one part of it.
| They're still training these 7bn parameter models on the
| whole data set, and just crunching through that takes
| enormous compute, that people just didn't have at the
| current price points until now.
|
| I should also mention that the idea itself of using GPUs
| for compute and then specifically for AI training was an
| innovation. And the idea that simply scaling up was going
| to be worth the investment is another major innovation.
| It's not just the existence of the compute power, it's
| the application to NN training tasks that got us here.
|
| Here[0] is an older OpenAI post about this very topic.
| They estimate that between 2012 and 2018, the compute
| power used for training the SotA models at those times
| increased roughly 300,000 times, doubling every ~3.5
| months.
|
| [0] https://openai.com/index/ai-and-compute/
| _giorgio_ wrote:
| Chollet is just a philosopher. He also thinks that keras and
| tensorflow are important, when nobody uses those. And he
| punished false days about their usage.
| eru wrote:
| Well, you also need an approach to 'curve fitting' where it's
| actually computationally feasible to fit the curve. The
| approach of mixing layers of matrix multiplication with a
| simple non-linearity like max(0, x) (ReLU) works really well
| for that. Earlier on they tried more complicated non-
| linearities, like sigmoids, or you could try an arbitrary curve
| that's not split into layers at all, you would probably find it
| harder. (But I'm fairly sure in the end you might end up in the
| same place, just after lots more computation spent on fitting.)
| tippytippytango wrote:
| Inductive bias matters. A lot.
| avereveard wrote:
| well yes but actually no I guess: the transformers benefit at
| the time was that they were more stable while learning,
| enabling larger and larger network and dataset to be learnt.
| WithinReason wrote:
| If you spent some time actually training networks you know
| that's not true, that's why batch norm, dropout, regularization
| is so successful. They don't increase the network's capacity
| (parameter count) but they increase its ability to learn.
| m11a wrote:
| It'd be nice to see more of how this compares to Mamba. Looks
| like, in performance, they're not leagues apart and it's just a
| _different_ architecture, not necessarily better or worse?
| yazzku wrote:
| Look at the memory consumption diagram on page 6. It looks like
| you're basically getting the same running time for less memory
| usage.
| dsamarin wrote:
| The name of the paper contrasts with the paper that spawned
| Transformer architecture, which itself is a reference to the song
| "All You Need Is Love" by the Beatles.
| https://en.wikipedia.org/wiki/Attention_Is_All_You_Need
| vundercind wrote:
| I eagerly await the backlash to suggesting any one thing is all
| you need, the first shot of which shall surely be titled: "'All
| you need' Considered Harmful"
| ants_everywhere wrote:
| Surely the universe is all you need though
| radarsat1 wrote:
| Interstellar taught me that love transcends the universe.
| Ergo..
| marcosdumay wrote:
| R == Recurrent
|
| From theory the answer to the question should be "yes", they are
| Turing complete.
|
| The real question is about how to train them, and the paper is
| about that.
| baanist wrote:
| Why aren't AI researchers automating the search for efficient
| architectures?
| ks2048 wrote:
| https://en.wikipedia.org/wiki/Neural_architecture_search
| kelseyfrog wrote:
| The search space is all off too wide, difficult to
| parameterize, and there is a wide gap between effective and
| ineffective architectures - ie: a very small change can make
| a network effectively DOA.
| hedgehog wrote:
| Notably architecture search was popular for small vision
| nets where the cost of many training runs was low enough. I
| suspect some of the train-then-prune approaches will come
| back, but even there only by the best funded teams.
| ActorNightly wrote:
| There has been some work, but the problem is that its such a
| massive search space. Philosophically speaking, if you look
| at how humans came into existence, you could make an argument
| that the process of evolution from basic lifeforms can be
| represented as one giant compute per minute across of all of
| earth, where genetic selection happens and computation
| proceeds to the next minute. Thats a fuckload of compute.
|
| In more practical terms, you would imagine that an advanced
| model contains some semblance of a CPU to be able to truly
| reason. Given that CPUs can be all NAND gates (which take 2
| neurons to represent), and are structured in a recurrent way,
| you fundamentally have to rethink how to train such a
| network, because backprop obviously won't work to capture
| things like binary decision points.
| baanist wrote:
| I thought the whole point of neural networks was that they
| were good at searching through these spaces. I'm pretty
| sure OpenAI is pruning their models behind the scenes to
| reduce their costs because that's the only way they can
| keep reducing the cost per token. So their secret sauce at
| this point is whatever pruning AI they're using to whittle
| the large computation graphs into more cost efficient
| consumer products.
| spencerchubb wrote:
| When you train a neural network, it is not search, it is
| descending through a curve.
|
| If you were to search for billions of parameters by brute
| force, you literally could not do it in the lifespan of
| the universe.
|
| A neural network is differentiable, meaning you can take
| the derivative of it. You train the parameters by taking
| finding gradient with respect to each parameter, and
| going in the opposite direction. Hence the name of the
| popular algorithm, gradient descent.
| bob1029 wrote:
| A biological neural network is certainly not
| differentiable. If the thing we want to build is not
| realizable with this technique, why can't we move on from
| it?
|
| Gradient descent isn't the only way to do this.
| Evolutionary techniques can explore impossibly large,
| non-linear problem spaces.
|
| Being able to define any kind of fitness function you
| want is sort of like a super power. You don't have to
| think in such constrained ways down this path.
| og_kalu wrote:
| >A biological neural network is certainly not
| differentiab
|
| Biology is biology and has its constraints. Doesn't
| necessarily mean a biologically plausible optimizer would
| be the most efficient or correct way in silicon.
|
| >If the thing we want to build is not realizable with
| this technique, why can't we move on from it?
|
| All the biologically plausible optimizers we've fiddled
| with (and we've fiddled with quite a lot) just work
| (results wise) like gradient descent but worse. We've not
| "moved on" because gradient descent is and continues to
| be better.
|
| >Evolutionary techniques can explore impossibly large,
| non-linear problem spaces.
|
| Sure, with billions of years (and millions of concurrent
| experiments) on the table.
| xpe wrote:
| Program synthesis is a generalization of this. I'm not sure
| that many ML researchers have thought about the connections
| yet.
| jjtheblunt wrote:
| What are you saying is Turing-complete?
| baanist wrote:
| Neural networks are Turing complete, i.e. there is a
| universal neural network that can compute any effectively
| computable function1. Incidentally, when this is combined
| with Rice's theorem2 it means that safety research is
| essentially an unsolvable problem because any non-trivial
| property of a sufficiently complex neural network, e.g. one
| that can simulate a Turing machine, will have properties
| which can not be predicted with finite computation.
|
| 1: https://www.sciencedirect.com/science/article/pii/08939659
| 91...
|
| 2: https://en.wikipedia.org/wiki/Rice%27s_theorem?useskin=vec
| to...
| jjtheblunt wrote:
| super interesting, and i'd not seen either reference.
| thanks very much.
| logicchains wrote:
| The model in the paper isn't a "real" RNN due making it
| parallelizable, for same the reasons described in
| https://arxiv.org/abs/2404.08819 , and hence is theoretically
| less powerful than a "real" RNN (struggles at some classes of
| problems that RNNs traditionally excel at). On the other hand,
| https://arxiv.org/abs/2405.04517 contains a "real" RNN component,
| which demonstrates a significant improvement on the kind of
| state-tracking problems that transformers struggle with.
| robertsdionne wrote:
| These are real RNNs, they still depend upon the prior hidden
| state, it's just that the gating does not. The basic RNN
| equation can be parallelized with parallel prefix scan
| algorithms.
| bob1029 wrote:
| > Transformers required ~2.5x more training steps to achieve
| comparable performance, overfitting eventually.
|
| > RNNs are particularly suitable for sequence modelling settings
| such as those involving time series, natural language processing,
| and other sequential tasks where context from previous steps
| informs the current prediction.
|
| I would like to draw an analogy to digital signal processing. If
| you think of the recurrent-style architectures as IIR filters and
| feedforward-only architectures as FIR filters, you will likely
| find many parallels.
|
| The most obvious to me being that IIR filters typically require
| far fewer elements to produce the same response as an equivalent
| FIR filter. Granted, the FIR filter is often easier to
| implement/control/measure in practical terms (fixed-point
| arithmetic hardware == ML architectures that can run on GPUs).
|
| I don't think we get to the exponential scary part of AI without
| some fundamentally recurrent architecture. I think things like
| LSTM are kind of an in-between hack in this DSP analogy - You
| could look at it as FIR with dynamic coefficients. Neuromorphic
| approaches seem like the best long term bet to me in terms of
| efficiency.
| wslh wrote:
| ELI5: Could you explain what neuromorphic approaches mean, and
| how they contribute to AI/AGI? My first impression as a
| layperson (probably wrong) is that this approach resembles
| ideas from the book "The Society of the Mind", where the system
| isn't just simulating neurons but involves a variety of methods
| and interactions across "agents" or sub-systems.
| bob1029 wrote:
| Neuromorphic mostly just means "like how the brain works". It
| encompasses a variety of software & hardware approaches.
|
| The most compelling and obvious one to me is hardware
| purpose-built to simulate spiking neural networks. In the
| happy case, SNNs are extremely efficient. Basically consuming
| no energy. You could fool yourself into thinking we can just
| do this on the CPU due to the sparsity of activations. I
| think there is even a set of problems this works well for.
| But, in the unhappy cases SNNs are impossible to simulate on
| existing hardware. Neuronal avalanches follow power law
| distribution and meaningfully-large ones would require very
| clever techniques to simulate with any reasonable fidelity.
|
| > the system isn't just simulating neurons but involves a
| variety of methods and interactions across "agents" or sub-
| systems.
|
| I think the line between "neuron" and "agent" starts to get
| blurry in this arena.
| seanhunter wrote:
| We somehow want a network that is neuromorphic in structure
| but we don't want it to be like the brain and take 20 years
| or more to train?
|
| Secondly how do we get to claim that a particular thing is
| neuromorphic when we have such a rudimentary understanding
| of how a biological brain works or how it generates things
| like a model of the world, understanding of self etc etc.
| planetpluta wrote:
| Something to consider is that it really could take 20+
| years to train like a brain. But once you've trained it,
| you can replicate at ~0 cost, unlike a brain.
| kybernetikos wrote:
| > we don't want it to be like the brain and take 20 years
| or more to train?
|
| Estimates put training of gpt4 at something like 2500 gpu
| years to train, over about 10000 gpus. 20 years would be
| a big improvement.
| seanhunter wrote:
| 1 GPU year is in no way comparable to 1 chronological
| year of learning for a human brain though.
| kybernetikos wrote:
| Yes, but the underlying point is that in this case you
| can train the AI in parallel, and there's a decent chance
| this or something like it will be true for future AI
| architectures too. What does it matter that the AI needs
| to be trained on 20 years of experiences if all of those
| 20 years can be experienced in 6 months given the right
| hardware?
| wslh wrote:
| My take, for pragmatic reasons rather than how the brain
| actually works, is that an agent-based architecture is
| great because some tasks can be solved more effectively by
| specific algorithms or workflows rather than operating at
| the low level of neural networks (NN).
| mafribe wrote:
| Neuromorphic has been an ongoing failure (for general purpose
| processors or even AI accelerators), ever since Carver Mead
| introduced (and quickly abandoned them) them nearly half a
| century ago. Bill Dally (NVidia CTO) concurs: _" I keep
| getting those calls from those people who claim they are
| doing neuromorphic computing and they claim there is
| something magical about it because it's the way that the
| brain works ... but it's truly more like building an airplane
| by putting feathers on it and flapping with the wings!"_
| From: Hardware for Deep Learning, HotChips 2023 keynote.
|
| We have NO idea how the brain produces intelligence, and as
| long as that doesn't change, "neuromorphic" is merely a
| marketing term, like Neurotypical, Neurodivergent,
| Neurodiverse, Neuroethics, Neuroeconomics, Neuromarketing,
| Neurolaw, Neurosecurity, Neurotheology, Neuro-Linguistic
| Programming: the "neuro-" prefix is suggesting a deep
| scientific insight to fool the audience. There is no hope of
| us cracking the question of how the human brain produces
| high-level intelligence in the next decade or so.
|
| Neuromorphic does work for some special purpose applications.
| chasd00 wrote:
| I like the feather analogy. Early on all humans knew about
| flight was from biology (watching birds fly) but trying to
| make a flying machine modeled after a bird would never
| work. We can fly today but plane designs are nothing like
| biological flying machines. In the same way, all we know
| about intelligence comes from biology and trying to invent
| an AGI modeled on biological intelligence may be just as
| impossible as a plane designed around how birds fly.
|
| /way out of my area of expertise here
| quotemstr wrote:
| And it's only now, having built our own different kind of
| flying machine, that we understand the principles of
| avian flight well enough to build our own ornithopters.
| (We don't use ornithopters because they're not practical,
| but we've known how to build them since the 1960s.) We
| would have never gotten here had we just continued to try
| to blindly copy birds.
| fennecfoxy wrote:
| I love this book and have it sitting on my shelf right now!
| Read it when I was a kid and was amazed at the ideas in it,
| nowadays it's clearer to me that the author only had a grasp
| of how things like that would be built but still cool
| nonetheless.
|
| I would highly recommend it to people who love a good "near
| future" scifi book.
| bwanab wrote:
| I'm sure you know this, but I think "the author" Marvin
| Minsky should be mentioned by name since he was one of the
| foundational theorists in the field of AI in general, but
| particularly in NNs.
| manjunaths wrote:
| Can we even implement IIR filters to give good performance and
| scaling at large scale on current architectures like GPUs ?
| bob1029 wrote:
| I don't think so. FIR filters can be unrolled and
| parallelized over the data. These are definitely possible to
| do on GPU to great effect. But, IIR filters constantly depend
| on the output of the prior time step, so you can't unroll
| anything. These would probably be faster to simulate on the
| CPU.
| x3haloed wrote:
| > I don't think we get to the exponential scary part of AI
| without some fundamentally recurrent architecture
|
| I've been thinking the same for a while, but I'm starting to
| wonder if giant context windows are good enough to get us
| there. I think recurrency is more neuromorphic, and possibly
| important in the longer run, but maybe not required for SI.
|
| I'm also just a layman with just a surface level understanding
| of these things, so I may be completely ignorant and wrong.
| lr1970 wrote:
| Again from signal processing: depending on position of the
| poles in z-transformed filter transfer function the output of
| IIR has a narrow stability region that is typically carefully
| designed for. Otherwise IIR filters either exponentially decay
| to zero to exponentially grow to infinity. RNN cells like LSTM
| are "decaying filters" with non-linear gates introduced to stop
| decay and to "remember" things.
|
| FIR filters are way simpler to design and can capture memory
| without hacks.
| PunchTornado wrote:
| To me this is further evidence that these LLMs learn only to
| speak English, but there is no reasoning at all in them. If you
| simplify a lot and obtain the same results and we know how
| complex the brain is.
| quantadev wrote:
| Every LLM expert on the planet agrees LLMs are doing
| "reasoning". No one says they have feelings or qualia, but we
| all know there's definitely genuinely artificial reasoning
| happening.
|
| What LLMs have shown both Neuroscience and Computer Science is
| that reasoning is a mechanical process (or can be simulated by
| mechanical processes) and is not purely associated only with
| consciousness.
| roboboffin wrote:
| I'm not sure that's true at all. There are several well known
| researchers that say LLMs are in fact not doing reasoning.
| quantadev wrote:
| Those are all the people that have not yet decoupled
| "reasoning" from "consciousness" in their own way of
| thinking. It's admittedly hyperbolic to say "everyone". I
| love hyperbole on HN. :)
| roboboffin wrote:
| For example, papers like this call into question whether
| or not a LLM can plan:
|
| https://arxiv.org/html/2409.13373v1
|
| This is a basic form of reasoning, to plan out the steps
| needed to execute something.
| quantadev wrote:
| Planning, by definition, takes multiple reasoning steps.
| A single LLM inference is a fundamental single reasoning
| step, but it's a reasoning step nonetheless.
|
| It's like I'm saying a house is made of bricks. You can
| build a house of any shape out of bricks. But once bricks
| have been invented you can build houses. The LLM
| "reasoning" that even existed as early as GPT3.5 was the
| "brick" with which highly intelligent agents can be built
| out of, with no further "breakthroughs" being required.
|
| The basic Transformer Architecture was enough and already
| has the magical ingredient of reasoning. The rest is just
| a matter of prompt engineering.
| roboboffin wrote:
| It's not reasoning, it retrieval of a pattern, and that
| pattern may contain reasoning.
|
| The prompt engineering is the real reasoning, provided by
| the human.
| quantadev wrote:
| Yeah, these kinds of discussions always devolve purely
| into debates about what's the proper definition of words.
| Especially on HN where everyone has their "Pedantic Knob"
| dialed up to 11.
| roboboffin wrote:
| I understand your point. I apologise, if I am coming
| across pendantic.
|
| My point is computers already follow algorithms, and
| algorithms contain reasoning; but the computers are not
| reasoning themselves. At least, not yet!
| arolihas wrote:
| You're not being pedantic at all. It's a crucial
| distinction that people try to wave away in favor of
| hype. Especially since we are so vulnerable to
| anthropomorphizing.
| quantadev wrote:
| You weren't being pedantic yourself. My point is that
| this discussion is ultimately about the definition of
| words, and that all by itself, makes the discussion
| meaningless.
|
| I think a "granule" of "reasoning" happens at each
| inference, and you think there is no reasoning in a
| single inference. To discuss it further would be a game
| of whose definition of any given word is correct.
| adamnemecek wrote:
| Yes, all machine learning can be interpreted in terms of
| approximating the partition function.
|
| This is obvious when one considers the connections between
| Transformers, RNNs, Hopfield networks and the Ising model, a
| model from statistical mechanics which is solved by calculating
| the partition function.
|
| This interpretation provides us with some very powerful tools
| that are commonplace in math and physics but which are not talked
| about in CS & ML.
|
| I'm working on a startup http://traceoid.ai which takes this
| exact view. Our approach enables faster training and inference,
| interpretability and also scalable energy-based models, the Holy
| Grail of machine learning.
|
| Join the discord https://discord.com/invite/mr9TAhpyBW or follow
| me on twitter https://twitter.com/adamnemecek1
| mkaic wrote:
| I strongly enjoy the simplicity of their "minGRU" architecture.
| It's basically just: class MinGRU(nn.Module):
| def __init__(self, token_size, hidden_state_size):
| self.token_to_proposal = nn.Linear(token_size, hidden_size)
| self.token_to_mix_factors = nn.Linear(token_size, hidden_size)
| def forward(self, previous_hidden_state, current_token):
| proposed_hidden_state = self.token_to_proposal(current_token)
| mix_factors =
| torch.sigmoid(self.token_to_mix_factors(current_token))
| return torch.lerp(proposed_hidden_state, previous_hidden_state,
| mix_factors)
|
| And since the proposed hidden states and mix factors for each
| layer are both only dependent on the current token, you can
| compute all of them in parallel if you know the whole sequence
| ahead of time (like during training), and then combine them in
| linear time using parallel scan.
|
| The fact that this is competitive with transformers and state-
| space models in their small-scale experiments is gratifying to
| the "best PRs are the ones that delete code" side of me. That
| said, we won't know for sure if this is a capital-B Breakthrough
| until someone tries scaling it up to parameter and data counts
| comparable to SOTA models.
|
| One detail I found really interesting is that they seem to do all
| their calculations in log-space, according to the Appendix. They
| say it's for numerical stability, which is curious to me--I'm not
| sure I have a good intuition for why running everything in log-
| space makes the model more stable. Is it because they removed the
| tanh from the output, making it possible for values to explode if
| calculations are done in linear space?
|
| EDIT: Another thought--it's kind of fascinating that this sort of
| sequence modeling works at all. It's like if I gave you all the
| pages of a book individually torn out and in a random order, and
| asked you to try to make a vector representation for each page as
| well as instructions for how to mix that vector with the vector
| representing all previous pages -- except you have zero knowledge
| of those previous pages. Then, I take all your page vectors,
| sequentially mix them together in-order, and grade you based on
| how good of a whole-book summary the final vector represents.
| Wild stuff.
|
| FURTHER EDIT: Yet _another_ thought--right now, they 're just
| using two dense linear layers to transform the token into the
| proposed hidden state and the lerp mix factors. I'm curious what
| would happen if you made those transforms MLPs instead of
| singular linear layers.
| immibis wrote:
| This architecture, on the surface, seems to preclude the basic
| function of recognizing sequences of tokens. At the very least,
| it seems like it should suffer from something like the pumping
| lemma: if [the ][cat ][is ][black ] results in the output
| getting close to a certain vector, [the ][cat ][is ][black
| ][the ][cat ][is ][black ][the ][cat ][is ][black ] should get
| even closer to that vector and nowhere close to a "why did you
| just repeat the same sentence three times" vector? Without non-
| linear mixing between input token and hidden state, there will
| be a lot of linear similarities between similar token
| sequences...
| mkaic wrote:
| Counterpoint: the hidden state at the beginning of
| ([the][cat][is][black]) x 3 is (probably) initialized to all
| zeros, but after seeing those first 4 tokens, it will _not_
| be all zeros. Thus, going into the second repetition of the
| sentence, the model has a different initial hidden state, and
| should exhibit different behavior. I think this makes it
| possible for the model to learn to recognize repeated
| sequences and avoid your proposed pitfall.
| immibis wrote:
| The new hidden state after the first repetition will just
| be a linear combination between zero and what the non-
| recurring network outputs. After more repetitions, it will
| be closer to what the network outputs.
| slashdave wrote:
| Log space is important if the token probabilities span a large
| range of values (powers). There is a reason that maximum
| likelihood fitting is always performed with log likelihoods.
| aDyslecticCrow wrote:
| I don't think it's a capital-B Breakthrough, but recurrent
| networks are everywhere, and a simplification that improved
| training and performance clears the stage to build back
| complexity up again to even higher hights.
| trott wrote:
| My feeling is that the answer is "no", in the sense that these
| RNNs wouldn't be able to universally replace Transformers in
| LLMs, even though they might be good enough in some cases and
| beat them in others.
|
| Here's why.
|
| A user of an LLM _might_ give the model some long text and then
| say "Translate this into German please". A Transformer can look
| back at its whole history. But what is an RNN to do? While the
| length of its context is unlimited, the amount of information the
| model retains about it is bounded by whatever is in its hidden
| state at any given time.
|
| Relevant: https://arxiv.org/abs/2402.01032
| mkaic wrote:
| The counterargument here is that you can just scale the size of
| the hidden state sufficiently such that it can hold compressed
| representations of whatever-length sequence you like.
| Ultimately, what I care about is whether RNNs could compete
| with transformers if FLOPs are held constant--something TFA
| doesn't really investigate.
| psb217 wrote:
| Well, that's what Transformer already does... One problem
| with the scaling you're describing is that there would be a
| massive amount of redundant information stored in hidden
| activations during training the RNN. The hidden state at each
| time step t in the sequence would need to contain all info
| that (i) could be useful for predicting the token at time t
| and (ii) that could be useful for predicting tokens at times
| >t. (i) is obvious and (ii) is since all information about
| the past is transferred to future predictions through the
| current hidden state. In principle, Transformers can avoid
| storing redundant info in multiple hidden states at the cost
| of having to maintain and access (via attention) a larger
| hidden state at test/eval time.
| mkaic wrote:
| > there would be a massive amount of redundant information
| stored in hidden activations
|
| Is there a way to prove this? One potential caveat that
| comes to mind for me is that perhaps the action of lerping
| between the old state and the new could be used by the
| model to perform semantically meaningful transformations on
| the old state. I guess in my mind it just doesn't seem
| obvious that the hidden state is necessarily a collection
| of "redundant information" -- perhaps the information is
| culled/distilled the further along in the sequence you go?
| There will always be _some_ redundancy, sure, but I don 't
| think that such redundancy necessarily means we _have_ to
| use superlinear methods like attention.
| psb217 wrote:
| All information about the past which will be available
| for predicting future tokens must be stored in the
| present state. So, if some bits of info about some past
| tokens at times less than t_p will be used for predicting
| some future token at time t_f, those bits must be passed
| through all states at times from t_p to t_f. The bits are
| passed through the recurrence. Once information about
| past tokens is lost from the hidden state it is gone
| forever, so it must be stored and carried across many
| steps up until it finally becomes useful.
|
| The information cost of making the RNN state way bigger
| is high when done naively, but maybe someone can figure
| out a clever way to avoid storing full hidden states in
| memory during training or big improvements in hardware
| could make memory use less of a bottleneck.
| phkahler wrote:
| >> A user of an LLM might give the model some long text and
| then say "Translate this into German please". A Transformer can
| look back at its whole history.
|
| Which isn't necessary. If you say "translate the following to
| german." Instead, all it needs is to remember the task at hand
| and a much smaller amount of recent input. Well, and the
| ability to output in parallel with processing input.
| og_kalu wrote:
| It's necessary for arbitrary information processing if you
| can forget and have no way to "unforget".
|
| A model can decide to forget something that turns out to be
| important for some future prediction. A human can go back and
| re-read/listen etc, A transformer is always re-reading but a
| RNN can't and is fucked.
| magicalhippo wrote:
| That's just because we twisted it's arm. One could for
| example feed the reversed input after, ie abc|cba where |
| is a special token. That would allow it to react to any
| part of the message.
| ebalit wrote:
| I think this might be key, in addition to some landmark
| tokens to quickly backtrack to. The big question is how
| to train such model.
|
| There is a recent paper from Meta that propose a way to
| train a model to backtrack its generation to improve
| generation alignment [0].
|
| [0] https://arxiv.org/html/2409.14586v1
| tsimionescu wrote:
| If the networks are to ever be a path to a closer to
| general intelligence, they will anyway need to be able to
| ask for context to be repeated, or to have separate storage
| where they can "choose" to replay it themselves. So this
| problem likely has to be solved another way anyway, both
| for transformers and for RNNs.
| og_kalu wrote:
| For a transformer, context is already always being
| repeated every token. They can fetch information that
| _became_ useful anytime they want. I don 't see what
| problem there is to solve here.
| tsimionescu wrote:
| For a transformer, context is limited, so the same kind
| of problem applies after you exceed some size.
| trott wrote:
| People did something similar to what you are describing 10
| years ago: https://arxiv.org/abs/1409.0473
|
| But it's trained on translations, rather than the whole
| Internet.
| DoctorOetker wrote:
| Also, a lightweight network could do a first pass to identify
| tasks, instructions, constraints etc, and then a second pass
| could use the RNN.
|
| Consider the flood fill algorithm or union-find algorithm,
| which feels magical upon first exposure.
|
| https://en.wikipedia.org/wiki/Hoshen%E2%80%93Kopelman_algori.
| ..
|
| Having 2 passes can enable so much more than a single pass.
|
| Another alternative could be to have a first pass make notes
| in a separate buffer while parsing the input. The bandwidth
| of the note taking and reading can be much much lower than
| that required for fetching the billions of parameters.
| slashdave wrote:
| > the amount of information the model retains about it is
| bounded by whatever is in its hidden state
|
| This is no different than a transformer, which, after all, is
| bound by a finite state, just organized in a different manner.
| trott wrote:
| > This is no different than a transformer, which, after all,
| is bound by a finite state, just organized in a different
| manner.
|
| It's not just a matter of organizing things differently.
| Suppose your network dimension and sequence length are both
| X.
|
| Then your memory usage (per layer) will be O(X^2), while your
| training update cost will be O(X^3). That's for both
| Transformers and RNNs.
|
| However, at the end of the sequence, a Transformer layer can
| look back see O(X^2) numbers, while an RNN can only see O(X)
| numbers.
| slashdave wrote:
| Simplistic thinking. An RNN hidden parameter space of high
| dimension provides plenty of room for linear projections of
| token histories. I think people just do not realize just
| how huge R^N can be.
| trott wrote:
| > Simplistic thinking. An RNN hidden parameter space of
| high dimension provides plenty of room for linear
| projections of token histories. I think people just do
| not realize just how huge R^N can be.
|
| 16N bits as hard limit, but more realistically, about 2N
| bits or less of useful information probably.
|
| You'd need to grow the network dimension in proportion to
| the maximum sequence length just to avoid the information
| theoretical limit.
| f_devd wrote:
| Transformers actually have an quantifiable state size (see
| https://hazyresearch.stanford.edu/static/posts/2024-06-22-a
| c...) although it's anywhere between 200k and 2M floats
| (for 360M and 1.33B respectively iinm). So a sufficiently
| sized RNN could have the same state capacity as a
| transformer.
|
| (this is from the Based paper:
| https://arxiv.org/pdf/2402.18668)
| trott wrote:
| > Transformers actually have an quantifiable state size
|
| Are you griping about my writing O(X^2) above instead of
| precisely 2X^2, like this paper? The latter implies the
| former.
|
| > So a sufficiently sized RNN could have the same state
| capacity as a transformer.
|
| Does this contradict anything I've said? If you increase
| the size of the RNN, while keeping the Transformer fixed,
| you can match their recurrent state sizes (if you don't
| run out of RAM or funding)
| f_devd wrote:
| I was responding to
|
| > a Transformer layer can look back see O(X^2) numbers,
| while an RNN can only see O(X) numbers
|
| The thing is RNN can look back infinitely if you don't
| exceed the state capacity. For transformers the state it
| is defined semi-implicitly (you can change the hidden
| dims but you cannot extend the look back; ignoring
| transformer-xl et al.) defined by the amount of tokens,
| for an RNN it's defined explicitly by the state size.
|
| The big-O here is irrelevant for the architectures since
| it's all in the configuration & implementation of the
| model; i.e. there is no relevant asymptote to compare.
|
| As an aside this was what was shown in the based paper,
| the fact that you can have a continuity of state (as with
| RNN) while have the same associative recall capability as
| a transformer (the main downfall of recurrent methods at
| that point).
| trott wrote:
| > The big-O here is irrelevant for the architectures
| since it's all in the configuration & implementation of
| the model; i.e. there is no relevant asymptote to
| compare.
|
| ?!
|
| NNs are like any other algorithm in this regard. Heck,
| look at the bottom of page 2 of the Were RNNs All We
| Needed paper. It has big-O notation there and elsewhere.
|
| > I was responding to
|
| >> a Transformer layer can look back see O(X^2) numbers,
| while an RNN can only see O(X) numbers
|
| In the BASED paper, in Eq. 10, sizeof(s) = 2dN. But I
| defined d = N = X above. Ergo, sizeof(s) = 2X^2 = O(X^2).
|
| For minGRU, sizeof(s) = d. Ergo, sizeof(s) = X = O(X).
| f_devd wrote:
| That's just the state calculation which would be O(N) and
| O(1) respectively. The based paper is saying if you made
| Transformers recurrent you would have a state size of 2Nd
| -> O(N), while based has a state size of d*d' -> O(1).
|
| Transformers do have O(N^2) time & memory complexity, and
| Based/RNN/SSM {O(N) time, O(1) mem}, with respect to
| sequence length if that's what you mean. The point is it
| doesn't really give an indication of quality.
|
| We can choose our constant arbitrarily so the big-O
| you've stated only indicates memory/time-complexity not
| 'look-back' ability relevant to any task. If you input
| the entire sequence N times into an RNN, you also have
| perfect recall with O(N^2) but it's not exactly an
| efficient use of our resources.
|
| Ideally our state memory is maximally utilized, this is
| the case for RNNs in the limit (although likely
| oversubscribed) but is not the case for transformers. The
| holy grail is to have an input-dependent state-size,
| however that is quite difficult.
| tgv wrote:
| That problem has plagued RNNs since the 90s: there's an
| information precision problem (how many bits do you need older
| states to carry), a decay problem (the oldest information is
| the weakest) and a mixing problem (it tends to mix/sum
| representations).
| fhdsgbbcaA wrote:
| We really need a [preprint] flag for unreviewed papers.
| lgessler wrote:
| IMHO reviews are almost indistinguishable from noise at the AI
| conferences I'm familiar with these days anyway, so I don't see
| much of a value add.
| fhdsgbbcaA wrote:
| Sad state of affairs, people are incentivized to get more
| papers and citations at all costs, and quality be damned.
|
| An AI Winter is not a great an idea, but an AI Autumn may be
| beneficial.
|
| Just have no major AI conferences for '25, perhaps only
| accept really high tier literature reviews.
| limapedro wrote:
| This is such a interesting paper, sadly they don't have big
| models, I'd like to see a model trained on TinyStories or even C4
| since it should be faster than the transformer variant and see
| how it compares.
| charlescurt123 wrote:
| I find the entire field lacking when it comes to long-horizon
| problems. Our current, widely used solution is to scale, but
| we're nowhere near achieving the horizon scales even small mammal
| brains can handle. Our models can have trillions of parameters,
| yet a mouse brain would still outperform them on long-horizon
| tasks and efficiency. It's something small, simple, and elegant--
| an incredible search algorithm that not only finds near-optimal
| routes but also continuously learns on a fixed computational
| budget.
|
| I'm honestly a bit envious of future engineers who will be
| tackling these kinds of problems with a 100-line Jupyter notebook
| on a laptop years from now. If we discovered the right method or
| algorithm for these long-horizon problems, a 2B-parameter model
| might even outperform current models on everything except short,
| extreme reasoning problems.
|
| The only solution I've ever considered for this is expanding a
| model's dimensionality over time, rather than focusing on perfect
| weights. The higher dimensionality you can provide to a model,
| the greater its theoretical storage capacity. This could resemble
| a two-layer model--one layer acting as a superposition of
| multiple ideal points, and the other layer knowing how to use
| them.
|
| When you think about the loss landscape, imagine it with many
| minima for a given task. If we could create a method that
| navigates these minima by reconfiguring the model when needed, we
| could theoretically develop a single model with near-infinite
| local minima--and therefore, higher-dimensional memory. This may
| sound wild, but consider the fact that the human brain
| potentially creates and disconnects thousands of new connections
| in a single day. Could it be that these connections steer our
| internal loss landscape between different minima we need
| throughout the day?
| aDyslecticCrow wrote:
| Yes... The field lacks the HOLY GRAIL (long-horizon problems).
| But we don't need a mouse-brain to sort spam emails. The Hail
| Mary 2B+ parameter models and above are still niche uses of
| these algorithms (too heavy to run practically). There is
| plenty of room for clever and small models running on limited
| hardware and datasets to solve useful problems and nothing
| more.
|
| Models that change size as needed have been experimented with,
| but they are either too inefficient or difficult to optimize at
| a limited power budget. However, I agree that they are likely
| what is needed if we want to continue to scale upward in size.
|
| I suspect the real bottleneck is a breakthrough in training
| itself. Backpropagation loss is too simplistic to optimize our
| current models perfectly, let alone future larger ones. But
| there is no guarantee a better alternative exists which may
| create a fixed limit to current ML approaches.
| kgbcia wrote:
| Decision trees is all we needed
| vandahm wrote:
| I made a RNN for a college project because I was interested in
| obsolete historical technology and I thought I needed to seize
| the opportunity while it lasted, because once I was out of
| school, I'd never hear about neural networks ever again.
|
| Mine worked, but it was very simple and dog slow, running on my
| old laptop. Nothing was ever going to run fast on that thing, but
| I remember my RNN being substantially slower than a feed-forward
| network would have been.
|
| I was _so confident_ that this was dead technology -- an academic
| curiosity from the 1980s and 1990s. It was bizarre to see how
| quickly that changed.
| alkonaut wrote:
| I feel old. I made my masters thesis on RNN's for learning
| dynamic systems e.g. for control purposes (quite a novelty at
| the time, around 2000). We wrote the backprop in C++ and ran it
| over night. Yes it was slow as hell with the tiny gradients.
| The network architectures were e.g. 5 or 10 neurons in a single
| hidden layer. NN's were a tiny subject that you were lucky to
| find courses in. Then closed my eyes for two seconds and looked
| at the subject again in 2015. Wow.
| gdiamos wrote:
| RNNs always had better scaling law curves than transformers.
|
| BPTT was their problem
| Smerity wrote:
| Excited to see more people working on RNNs but wish their
| citations were better.
|
| In 2016 my team from Salesforce Research published our work on
| the Quasi-Recurrent Neural Network[1] (QRNN). The QRNN variants
| we describe are near identical (minGRU) or highly similar
| (minLSTM) to the work here.
|
| The QRNN was used, many years ago now, in the first version of
| Baidu's speech recognition system (Deep Voice [6]) and as part of
| Google's handwriting recognition system in Gboard[5] (2019).
|
| Even if there are expressivity trade-offs when using
| parallelizable RNNs they've shown historically they can work well
| and are low resource and incredibly fast. Very few of the
| possibilities regarding distillation, hardware optimization, etc,
| have been explored.
|
| Even if you need "exact" recall, various works have shown that
| even a single layer of attention with a parallelizable RNN can
| yield strong results. Distillation down to such a model is quite
| promising.
|
| Other recent fast RNN variants such as the RWKV, S4, Mamba et al.
| include citations to QRNN (2016) and SRU (2017) for a richer
| history + better context.
|
| The SRU work has also had additions in recent years (SRU++),
| doing well in speech recognition and LM tasks where they found
| similar speed benefits over Transformers.
|
| I note this primarily as the more data points, especially when
| strongly relevant, the better positioned the research is. A
| number of the "new" findings from this paper have been previously
| explored - and do certainly show promise! This makes sure we're
| asking new questions with new insights (with all the benefit of
| additional research from ~8 years ago) versus missing the work
| from those earlier.
|
| [1] QRNN paper: https://arxiv.org/abs/1611.01576
|
| [2] SRU paper: https://arxiv.org/abs/1709.02755
|
| [3]: SRU++ for speech recognition:
| https://arxiv.org/abs/2110.05571
|
| [4]: SRU++ for language modeling:
| https://arxiv.org/abs/2102.12459
|
| [5]: https://research.google/blog/rnn-based-handwriting-
| recogniti...
|
| [6]: https://arxiv.org/abs/1702.07825
| hdivider wrote:
| I still find it remarkable how we need such an extreme amount of
| electrical energy to power large modern AI models.
|
| Compare with one human brain. Far more sophisticated, even beyond
| our knowledge. What does it take to power it for a day? Some
| vegetables and rice. Still fine for a while if you supply pure
| junk food -- it'll still perform.
|
| Clearly we have a long, long way to go in terms of the energy
| efficiency of AI approaches. Our so-called _neural_ nets clearly
| don 't resemble the energy efficiency of actual biological
| neurons.
| Arch485 wrote:
| It's even less! A lot of those vegetables and rice go into
| powering your heart, muscles, organs, etc. and only a fraction
| is used for the brain.
|
| Maybe the future of AI is in organic neurons?
| jjmarr wrote:
| Food is extremely dense in energy. 1 food calorie is about 1.1
| Watt-hours. A hamburger is about 490 Wh. An AI model requires
| 0.047 kWh = 47 Wh to generate 1000 text responses.[1] If an LLM
| could convert hamburgers to energy, it could generate over
| 10000 prompt completions on a single hamburger.
|
| Based on my own experience, I would struggle to generate that
| much text without fries and a drink.
|
| [1] https://www.theverge.com/24066646/ai-electricity-energy-
| watt...
| hdivider wrote:
| During that time, your brain would do _far_ more than just
| that text generation though, beyond what we even know
| scientifically.
|
| But yes, food energy could be useful for AI. A little
| dystopian potentially too, if you think about it. Like
| DARPA's EATR robot, able to run on plant biomass (although
| potentially animal biomass too, including human remains):
|
| https://en.wikipedia.org/wiki/Energetically_Autonomous_Tacti.
| ..
| jjmarr wrote:
| AI is more energy-efficient than a human doing the same
| language-generation task is my point.
| Legend2440 wrote:
| This is more likely to be a hardware issue than an algorithms
| issue. The brain physically is a neural network, as opposed to
| a software simulation of one.
| lettergram wrote:
| In 2016 & 2017 my team at Capital One built several >1B parameter
| models combining LSTMs with a few other tricks.
|
| We were able to build generators that could replicate any dataset
| they were trained on, and would produce unique deviations, but
| match the statistical underpinnings of the original datasets.
|
| https://medium.com/capital-one-tech/why-you-dont-necessarily...
|
| We built several text generators for bots that similarly had very
| good results. The introduction of the transformer improved the
| speed and reduced the training / data requirements, but honestly
| the accuracy changed minimal.
| moi2388 wrote:
| Yes, and it's hardly surprising, since the Chinese room thought
| experiment is completely wrong; that is in fact exactly how you
| learn something.
| theanonymousone wrote:
| I remember that, the way I understood it, Transformers solved two
| major "issues" of RNNs that enabled the later boom: Vanishing
| gradients limiting the context (and model?) size and difficulty
| in parallelisation limiting the size of the training data.
|
| Do we have solutions for these two problems now?
| ebalit wrote:
| Transformers can also fetch at any moment any previous
| information that _become useful_.
|
| RNN are constantly updating and overwriting their memory. It
| means they need to be able to predict what is going to be
| useful in order to store it for later.
|
| This is a massive advantage for Transformers in interactive use
| cases like in ChatGPT. You give it context and ask questions in
| multiple turns. Which part of the context was important for a
| given question only becomes known later in the token sequence.
|
| To be more precise, I should say it's an advantage of
| Attention-based models, because there are also hybrid models
| successfully mixing both approaches, like Jamba.
| visarga wrote:
| You could theoretically run the input twice, allowing the
| model to correlate later tokens with previous ones. It would
| fix the problem with not knowing what information to retain.
| A more complicated approach would train the RNN to request
| replaying some earlier data when needed.
|
| A great thing about RNNs is they can easily fork the state
| and generate trees, it would be possible to backtrack and
| work on combinatorial search problems.
|
| Also easier to cache demonstrations for free in the initial
| state, a model that has seen lots of data is not using more
| memory than a model starting from scratch.
| imjonse wrote:
| Something like this?
|
| https://hazyresearch.stanford.edu/blog/2024-07-01-jrt
| visarga wrote:
| Yes, that's the paper.
| YeGoblynQueenne wrote:
| Vanishing (or exploding) gradients affected all deep
| architectures, not just RNNs. They were solved by LSTMs first
| proposed in 1997. See:
|
| https://www.semanticscholar.org/paper/Long-Short-Term-Memory...
|
| I find it interesting that this knowledge seems to be all but
| forgotten now. Back in the day, ca. 2014, LSTMs were all the
| rage, e.g. see:
|
| https://karpathy.github.io/2015/05/21/rnn-effectiveness/
|
| https://colah.github.io/posts/2015-08-Understanding-LSTMs/
| aDyslecticCrow wrote:
| LSTM and GRU did not quite solve the issue, but they made it
| less bad. Overall, recurrent units are nutritiously prone to
| vanishing and exploding gradients.
|
| I don't want to downplay the value of these models. Some
| people seem to be under the perception that transformers
| replaced or made them obsolete, which is faar from the truth.
| jszymborski wrote:
| > They were solved by LSTMs first proposed in 1997.
|
| I see this stuff everywhere online and it's often taught this
| way so I don't blame folks for repeating it, but I think it's
| likely promulgated by folks who don't train LSTMs with long
| contexts.
|
| LSTMs do add something like a "skip-connection" (before that
| term was a thing) which helps deal with the catastrophic
| vanishing gradients you get from e.g. Jordan RNNs right from
| the jump.
|
| However (!), while this stops us from seeing vanishing
| gradients after e.g. 10s or 100s of time-steps, when you
| start seeing multiple 1000s of tokens, the wheels start
| falling off. I saw this in my own research, training on amino
| acid sequences of 3,000 length led to a huge amount of
| instability. It was only after tokenizing the amino acid
| sequences (which was uncommon at the time) which got us down
| to ~1500 timesteps on average, did we start seeing stable
| losses at training. Check-out the ablation at [0].
|
| You can think of ResNets by analogy. ResNets didn't "solve"
| vanishing gradients, there's a practical limit of the depth
| of networks, but it did go a long way towards dealing with
| it.
|
| EDIT: I wanted to add, while I was trying to troubleshoot
| this for myself, it was super hard to find evidence online of
| why I was seeing instability. Everything pertaining to
| "vanishing gradients" and LSTMs were blog posts and pre-
| prints which just merrily repeated "LSTMs solve the problem
| of vanishing gradients". That made it hard for me, a junior
| PhD at the time, to suss out the fact that LSTMs do
| demonstrably and reliably suffer from vanishing gradients at
| longer contexts.
|
| [0] https://academic.oup.com/bioinformatics/article/38/16/395
| 8/6...
| jph00 wrote:
| Highway networks add a skip connection, but LSTMs don't.
| Btw you might be interested in truncated backprop thru
| time, which we introduced in our ULMFiT paper.
| jszymborski wrote:
| I was referring to how the context vectors help avoid
| vanishing gradients by behaving very similarly to skip-
| connections, but yes, they aren't skip-connections as-
| such. That's been my understanding, at least.
|
| We haven't tried truncated BPTT, but we certainly should.
|
| Funnily enough, we adopted AWD-LSTMs, Ranger21, and Mish
| in the paper I linked after I heard about them through
| the fast.ai community (we also trialled QRNNs for a bit
| too). fast.ai has been hugely influential in my work.
| twobitshifter wrote:
| Agreed, Ilya Sutskever himself has spent a long time with
| lstm and published papers like this one while working at
| Google. http://proceedings.mlr.press/v37/jozefowicz15.pdf
|
| Recent comments from him have said that any architecture can
| achieve transformer accuracy and recall, but we have devoted
| energy to refining transformers, due to the early successes.
| aDyslecticCrow wrote:
| From my (admittedly loose) reading of the paper, this paper
| particularly targets parallelization and fast training, not
| "vanishing gradients." However, by simplifying the recurrent
| units, they managed to improve both!
|
| This is very clever and very interesting. The paper
| continuously calls it a "decade-old architecture," but in
| practice, it's still used massively, thanks to its simplicity
| in adapting to different domains. Placing it as a "competitor"
| to transformers is also not quite fully fair, as transformers
| and RNNs are not mutually exclusive, and there are many methods
| that merge them.
|
| Improvement in RNNs is an improvement in a lot of other
| surprising places. A very interesting read.
| lccerina wrote:
| "Was all along a scheme by Google to sell more tensor processing
| units that didn't run RNNs well?"
| scotty79 wrote:
| The only strength of transformers is that they can run once for
| each token and they can pass to themselves intermediate state as
| they solve your problems. They have to conceal it in tokens that
| look to humans like a part of the response.
|
| It's obvious why the newest toy from openai can solve problems
| better mostly by just being allowed to "talk to itself" for a
| moment before starting the answer that human sees.
|
| Given that, modern incarnation of RNN can be vastly cheaper than
| transformers provided that they can be trained.
|
| Convolutional neural networks get more visual understanding by
| "reusing" their capacity across the area of the image. RNN's and
| transformers can have better understanding of a given problem by
| "reusing" their capacity to learn and infer across time (across
| steps of iterative process really).
|
| When it comes to transformer architecture the attention is a red
| herring. It's just more or less arbitrary way to partition the
| network so it can be parallelized. The only bit of potential
| magic is with "shortcut" links between non adjacent layers that
| help propagate learning back through many layers.
|
| Basically the optimal network is deep, dense (all neurons connect
| with all belonging to all preceding layers) that is ran in some
| form of recurrence.
|
| But we don't have enough compute to train that. So we need to
| arbitrarily sever some connections so the whole thing is easier
| to parallelized. It really doesn't matter which unless we do in
| some obviously stupid way.
|
| Actual inventive magic part of LLMs possibly happens in token and
| positional encoders.
| tadala wrote:
| Everyone wants to use less compute to fit more in, but
| (obviously?) the solution will be to use more compute and fit
| less. Attention isn't (topologically) attentive enough. All these
| RNN-lite approaches are doomed, beyond saving costs, they're
| going to get cooked by some other arch--even more expensive than
| transformers.
| falcor84 wrote:
| Would you mind expanding upon your thesis? If that compute and
| all those parameters aren't "fitting" the training examples,
| what is it that the model is learning, and how should that be
| analyzed?
| ithkuil wrote:
| I think there are two distinct areas. One is the building of
| the representations, which is achieved by fitting. The other
| area is loosely defined as "computing" which is some kind of
| searching for a path through representation space. All of
| that is wrapped in a translation layer that can turn those
| representations into stuff we humans can understand and
| interact with. All of that is achieved to some extent by
| current transformer architectures, but I guess some believe
| that they are not quite as effective at the
| "computation/search" stage.
| falcor84 wrote:
| But how does it get good at "computing"? The way I see it,
| we either program them to do so manually, or we use ML, at
| which case the model "fits" the computation based on
| training examples or environmental feedback, no? What am I
| missing?
| ithkuil wrote:
| the distinction is fuzzy indeed, especially if any thing
| that you "program in manually" has some parameters that
| are learned.
|
| Conceptually we already have parts of the model that are
| not learned: the architecture of the model itself.
| Sysreq2 wrote:
| Guys, I'm gonna stop this before it gets out of hand: All we need
| is love and a shit ton of compute.
|
| Everything else is just details.
___________________________________________________________________
(page generated 2024-10-04 23:01 UTC)