[HN Gopher] From deep to long learning?
       ___________________________________________________________________
        
       From deep to long learning?
        
       Author : headalgorithm
       Score  : 353 points
       Date   : 2023-04-09 12:41 UTC (10 hours ago)
        
 (HTM) web link (hazyresearch.stanford.edu)
 (TXT) w3m dump (hazyresearch.stanford.edu)
        
       | [deleted]
        
       | marshmallowmad wrote:
       | I don't quite understand why context length needs to keep
       | growing. It seems to me like many tasks (e.g. customizing your
       | LLM on your own data) would benefit from the model doing some
       | sort of sped up fine-tuning on any prompt that gets added. That
       | way we don't have to find all these hacks to find the most
       | relevant context and repeat the same info in prompts. Curious if
       | anyone has insight here as this has caused me some confusion
       | lately!
        
         | [deleted]
        
       | raphlinus wrote:
       | The thing that stuck out to me is the assertion that FFT is
       | poorly supported on modern GPU. That's surprising to me, as
       | there's cuFFT offically supported by Nvidia, and vkFFT that
       | achieves similar performance portably using compute shaders. I
       | believe these are based on f32 math, so perhaps the potential win
       | is using tensor cores to compute FFT at lower precision? It seems
       | surprising to me that decomposing into matrix operations is the
       | win here, it seems you'd do better writing a kernel that makes
       | use of the cooperative matrix (aka WMMA, tensor core,
       | simd_matrix) capabilities of the GPU.
        
         | svantana wrote:
         | Looking at the source paper, they are claiming a 2.2x speedup
         | over cuFFT for convolutions, so it's not an earth-shattering
         | gain, but still.
        
       | aqme28 wrote:
       | It seems to me like these long context lengths are going to make
       | a huge difference in the capabilities of things people are able
       | to produce. Agents can use LLMs to subdivide a large corpus into
       | manageable chunks. More context just makes them much much better
       | at that.
       | 
       | Regardless, for most programming tasks, I doubt my equivalent
       | human context length is any better than 32k tokens.
        
         | jerpint wrote:
         | The problem is not the short term context length, but long
         | term. If you want to do things like long term goal planning,
         | keeping track of distant past events can be of high value
        
           | taneq wrote:
           | Maybe that's what our inner monologue is for... to cycle
           | relevant context through regularly so it stays in scope. I
           | mean, that's not actually why we have one but it'd be cute.
        
           | PaulHoule wrote:
           | I'd argue for planning that the answer is to couple the LLM
           | to some other system rather than try to improve the LLM.
           | Combinatorial optimization is a well-understood problem that
           | is NP-complete in theory but practical to solve in practice.
           | For doing math we might as well give the LLM a pocket
           | calculator, why not couple it to a planner, theorem prover
           | and similar tools?
        
             | pixl97 wrote:
             | This is pretty much what people are testing now with Auto-
             | GPT
        
             | sharemywin wrote:
             | That's the interesting thing. if LLM start to just become
             | interfaces for other systems you could probably just use
             | them to train smaller systems like in the style of alpaca
             | and other recent LLMs.
        
             | pmoriarty wrote:
             | _" For doing math we might as well give the LLM a pocket
             | calculator, why not couple it to a planner, theorem prover
             | and similar tools?"_
             | 
             | ChatGPT has already been hooked up to Wolfram Alpha.[1]
             | 
             | For hooking up other things, see HuggingGPT and
             | TaskMatrix.AI.[2][3]
             | 
             | [1] - https://writings.stephenwolfram.com/2023/03/chatgpt-
             | gets-its...
             | 
             | [2] - https://arxiv.org/pdf/2303.17580.pdf
             | 
             | [3] - https://arxiv.org/pdf/2303.16434.pdf
        
             | btbuildem wrote:
             | Precisely. LLMs are the Perl of the future
        
         | PartiallyTyped wrote:
         | Our context is hierarchical, at multiple different resolutions,
         | so I don't think it's comparable.
        
           | aqme28 wrote:
           | I was trying to describe a hierarchy of LLMs as an agent. I
           | don't think that's a uniquely human ability.
        
           | btbuildem wrote:
           | I've had good results using that approach with LLMs for
           | domain-specific problem solving.
        
           | letitgo12345 wrote:
           | Good chance so is OAI's https://arxiv.org/abs/2110.13711
           | (paper has the long context lead at OAI as a co-author)
        
           | XorNot wrote:
           | I'd say when you're actively trying to remember an event and
           | "narrowing it down" mentally is a process that feels
           | suspiciously like a dialogue with an LLM where you keep
           | asking for the answer to be improved.
        
         | NhanH wrote:
         | Human expert relies in the "chunking" effect for expertise,
         | which is mostly not part of the context length (working
         | memory). For a pseudo-analogy, each human is fine-tuned from
         | the raw intelligence (by training and education). In that way
         | generic LLM probably can't beat us yet so don't despair!
        
           | onos wrote:
           | Any job focused on fact retrieval is at risk.
        
             | RandomLensman wrote:
             | Need to know that the facts exist in the first place,
             | though.
        
               | orbifold wrote:
               | GPT-4 is already at a level that it has graduate level
               | math and physics knowledge, with a level of recall that
               | most students can only dream of.
        
               | RandomLensman wrote:
               | Do those actually matter in jobs? Most valuable facts I
               | encounter are far more niche, often in someone's head and
               | not written down, only stored privately somewhere etc.
        
             | jimkoen wrote:
             | What kind of job is only focused on fact retrieval?
        
               | istjohn wrote:
               | My bigget blocker in web development is fact retrieval.
               | As someone who only dabbles, I don't struggle with how to
               | logically design my project, but I'm constantly
               | forgetting CSS and JS details like how to accomplish a
               | specific task with CSS flexbox or how to sort an array in
               | JS vs in Python. On old personal projects, I forget the
               | names of my own functions and whether they return a list
               | or a dict. Hell, I'll forget function signatures for
               | functions I just wrote. I forget external library and API
               | details. If I had perfect recall, I would 100x my web dev
               | productivity.
        
               | fnordpiglet wrote:
               | Jeopardy contestant
        
         | kmeisthax wrote:
         | The problem with comparing LLM and human context length is that
         | LLMs don't update their weights based on their input. The 32k
         | of context that they do have is their _only_ memory.
         | 
         | Humans have multiple layers of memory and can recall things and
         | concepts from years in the past - akin to millions of tokens'
         | worth of recall in an LLM. Yes, that memory is extremely lossy,
         | but it's there.
        
         | jacquesm wrote:
         | 'Superhuman' has many dimensions. It can be speed, it can be
         | ability to retain a lot of information, it can be the ability
         | to deal with a large quantity of information at once and many
         | other dimensions besides. For the longest time Chess was
         | considered a domain where computers could be dilettantes but
         | could never dominate. Chess is now in 'superhuman' territory
         | and likely we will never see a reversal because any insight
         | that benefits humans also benefits computers but not the other
         | way around.
         | 
         | The fact that this is such a multi-dimensional problem is
         | frequently overlooked in the debate about AI/AGI etc, it may
         | not matter all that much if the non AGI AI is already
         | superhuman on enough dimensions other than the ones that the
         | 'but it isn't AGI' crowd cling to. The consequences are what
         | matters, _not_ the fine print or the implementation details,
         | and those consequences are directly tied to the number of
         | dimensions along which a computer can beat humanity.
         | 
         | To give an example: if a chess program was 3000 Elo before but
         | so slow that it would lose under competition rules then humans
         | still dominated chess. Likewise if it would be only 2500 Elo
         | but fast enough, it would still lose from the best humans. But
         | for a large fraction of society it would have already moved out
         | into 'superhuman' territory. And a couple of technological
         | leaps of progress later and we're _all_ looking at that AI as
         | if it has moved into superhuman regions.
         | 
         | This sort of thing will happen on many fronts, and all of those
         | fronts are moving, if enough of them go past the threshold then
         | whether it is AGI or not is irrelevant and for every person
         | that threshold is at different points. Maybe a computer will be
         | able to calculate faster and better than you can, maybe it will
         | be able to translate text faster and better than you, maybe it
         | will be able to organize information faster and better than
         | you. At which point we cross the line into saying that it can
         | _think_ faster and better than you is hard, but we can see that
         | we are getting close to that line without even knowing exactly
         | where that line is.
        
           | pmoriarty wrote:
           | _" Chess is now in 'superhuman' territory and likely we will
           | never see a reversal because any insight that benefits humans
           | also benefits computers but not the other way around."_
           | 
           | What is considered human is malleable. It is conceivable that
           | humans will be enhanced in various biological and non-
           | biological ways to a point that they can once again compete
           | with computers.
        
             | Mezzie wrote:
             | We may also just change the rules of chess.
             | 
             | "Chess" is a human created game after all.
        
               | evrimoztamur wrote:
               | Somebody actually tried with
               | https://en.m.wikipedia.org/wiki/Arimaa, but it didn't
               | last too long until it was also figured out!
        
               | dmd wrote:
               | Look I found the guy who keeps moving the goalposts!
        
               | anthomtb wrote:
               | I think you're making a joke.
               | 
               | But moving goalposts is an integral part of all
               | professional sports. The 3 point line in basketball and
               | engine size and aspiration in motorsports are obvious
               | examples. I don't see how adjusting chess rules to dis-
               | favor AI competitors is any different.
        
               | Mezzie wrote:
               | It's a pretty common way for humans to deal with not
               | being able to do something/something not working.
        
           | andrepd wrote:
           | > For the longest time Chess was considered a domain where
           | computers could be dilettantes but could never dominate
           | 
           | I don't think this was ever true. Chess programs appeared
           | EXTREMELY early in, and everyone recognised that it was a
           | matter of time until hardware was quick enough to evaluate so
           | many positions per second that grandmasters could be defeated
           | by sheer calculation.
        
             | jacquesm wrote:
             | I was playing chess pretty fanatically when Sargon came out
             | and that was _my_ impression as a computer person, but the
             | chess people around me really didn 't think that computers
             | would ever beat the top GMs.
        
               | macintux wrote:
               | Most major advances are predicted by someone, and often
               | seem obvious in hindsight, but it seems like we often
               | conflate those and remember them having been obvious
               | before they happened.
        
               | jacquesm wrote:
               | The number of people working with computers back then was
               | but a handful so that gave me a bit of a different
               | perspective but I think that anybody that was both into
               | computers and into chess back then would have made the
               | same prediction. It still happened faster than I thought
               | it would.
        
           | ChatGTP wrote:
           | I'm curious, what is the point your trying to convey?
        
           | jstanley wrote:
           | > if a chess program was 3000 Elo before but so slow that it
           | would lose under competition rules then humans still
           | dominated chess.
           | 
           | How are you working out that it has a 3000 Elo if it's not
           | winning games?
        
             | jacquesm wrote:
             | That's the way it is done right now: by playing it against
             | other software implementations and judging them in the
             | exact same way they would judge humans.
             | 
             | Stockfish has an Elo rating over 3500 in spite of no human
             | being even close to that.
        
           | skybrian wrote:
           | One dimension that I think is pretty important is compute
           | cost due to its effect on how they're used. The chatbots are
           | expensive to run, which means they're implemented as request-
           | response API's that cost money, which means that loops are
           | expensive and they normally don't do any idle-time thinking.
           | 
           | When you play a turn-based game with a bot, that means you
           | don't need to worry about its reaction time. It's paused most
           | of the time, waiting on you. A sorcerer's apprentice scenario
           | isn't going to happen when you're single-stepping.
           | 
           | Moving to routine use of bots that run continuously with fast
           | reaction times will be much more dangerous.
        
             | jacquesm wrote:
             | Yes, that's a very good point. Effectively it is 'ping
             | pong' right now, when you get to always on + push things
             | will change quite a bit. Model efficiency is a very active
             | field.
        
           | m3kw9 wrote:
           | Is like a CPU, we've all seen that movie before, it's super
           | human in calculating numbers, it's one dimensional
           | nonetheless like chess programs.
        
       | mark_l_watson wrote:
       | Interesting about Butterfly architecture for hardware FFT
       | support. In the 1980s, DARPA provided two types of exotic
       | hardware to the company I worked for: the first Connection
       | Machine, and the Butterfly machine. I wrote Star Lisp code for
       | the CM, but never touched the Butterfly machine.
       | 
       | Off topic, but I am curious what hardware Apple will release in
       | the future for more direct AI support. Their Core ML libraries
       | working with Apple Silicon have been very effective so far. The
       | next step would likely be a built in foundation LLM model,
       | extending what they have supported with BERT models, etc.
        
         | sroussey wrote:
         | I think it will be quite a few years before an LLM is built
         | into their silicon.
         | 
         | But I do see a Neural Engine 2.0 in the future that will better
         | handle these things in the more near term future.
        
           | Sugimot0 wrote:
           | iirc creating new ml optimized chips/chiplets was part of the
           | appeal for risc-v right? Is risc-v relevant yet, or are there
           | any promising risc-v chips on the way? I know there's a lot
           | of hype around them, so i'm curious to know how much is just
           | noise or what the real sentiment is from the
           | experts/industry.
        
             | [deleted]
        
         | cs702 wrote:
         | _> The next step would likely be a built in foundation LLM
         | model, extending what they have supported with BERT models,
         | etc._
         | 
         | I'm thinking deeper. It wouldn't surprise me if _self-
         | attention_ itself becomes a _primitive building block_ of
         | future co-processors, e.g., with instructions and memory
         | layouts engineered to make ultra-low-precision self-attention
         | as compute- and memory-efficient as possible. I 'm expecting
         | LLMs with hundreds of billions and eventually trillions of
         | parameters will be able to run locally on my laptop and mobile
         | phone, in the not-too-distant future.[a]
         | 
         | [a] If this sounds far-fetched, consider that you can _already_
         | run LLMs with tens of billions of parameters on mobile phones:
         | https://justine.lol/mmap/
        
           | fpgaminer wrote:
           | > I'm expecting LLMs with hundreds of billions and eventually
           | trillions of parameters will be able to run locally on my
           | laptop and mobile phone, in the not-too-distant future
           | 
           | Perhaps. There's been a lot of focus on training-compute
           | optimal models in the industry. Rightfully so, as proofs of
           | concept. That's what led to this perceived parameter count
           | race in published models.
           | 
           | But remember the other side of the scaling laws. For
           | inference, which is what we want to do on our phones, it's
           | better to be inference-compute optimal. That means smaller
           | models trained for longer.
           | 
           | As far as we know today there are no limits of the scaling
           | laws. A 1B parameter model _can_ beat a 1T parameter model,
           | if trained for long enough. Of course it's exponential, so
           | you'd have to pour incalculable training resources into such
           | an extreme example. But I find these extreme examples
           | elucidating.
           | 
           | My pet theory these days is that we'll discover some way of
           | "simulating" multiple parameters from one stored parameter.
           | We know that training-compute optimal models are extremely
           | over-parameterized. So it isn't the raw capacity of the model
           | that's important. It seems like during training the degrees
           | of freedom is what allows larger models to be more sample
           | efficient. If we can find a cheap way of having one parameter
           | simulate multiple degrees of freedom, it will likely give us
           | the ability to gain the advantages of larger models during
           | training, without the inference costs later.
           | 
           | I don't disagree that we're likely to see more and more
           | parameter capacity from our devices. I'm just pointing out
           | that the parameter count race is a bit of an illusion. OpenAI
           | discovered the scaling laws and needed a proof of concept. If
           | they could show AI reaching X threshold first, they could
           | capture the market. The fastest way to do that is to be
           | training-compute optimal. So they had to scale to 175B
           | parameters or more. Now that it's proven, and that there's a
           | market for such an AI, their and other's focus can be on
           | inference-optimal models which are smaller but just as smart.
        
             | cs702 wrote:
             | _> I don 't disagree that we're likely to see more and more
             | parameter capacity from our devices. I'm just pointing out
             | that the parameter count race is a bit of an illusion.
             | OpenAI discovered the scaling laws and needed a proof of
             | concept. If they could show AI reaching X threshold first,
             | they could capture the market. The fastest way to do that
             | is to be training-compute optimal. So they had to scale to
             | 175B parameters or more. Now that it's proven, and that
             | there's a market for such an AI, their and other's focus
             | can be on inference-optimal models which are smaller but
             | just as smart._
             | 
             | Good point. That could very well be what they're thinking
             | about, in addition to potential improvements in training
             | data and RLHF methods.
             | 
             | Also, I agree it would be great if anyone figures out how
             | to do something akin to "making a smaller model act as if
             | it were gigantic during training" OR "pruning a gigantic
             | model's 'dead paths' as it learns during training," to get
             | the benefits of scale in training without its costs at
             | inference.
        
       | skepticATX wrote:
       | Can someone with a better understanding than I have comment about
       | the relationship between these results and this paper:
       | https://arxiv.org/abs/2109.09115, which seems to demonstrate that
       | a longer context length has diminishing returns?
        
         | rsfern wrote:
         | I'm not deeply familiar with all these papers, but two things
         | stand out to me
         | 
         | The model architectures are different, and in the very latest
         | paper they scale these not-transformer models to sequence
         | length of 64k, where the paper you linked only considers up to
         | 8k
        
       | AvAn12 wrote:
       | With longer (50k+) context lengths, is this just becoming a new
       | form of search?
        
         | intalentive wrote:
         | Yes, now we just need it to provide citations.
        
         | jmole wrote:
         | I think the K,Q,V representation was what fundamentally gave
         | rise to LLMs (from "Attention is all you need"), and I'm
         | certain that it wouldn't have happened without the researchers
         | having a background in search @ Google.
         | 
         | Or in other words, it was always a new form of search.
        
           | bckr wrote:
           | [Astronaut looking at earth with the logo of superhuman AI
           | superimposed]
           | 
           | Wait, it's all just search?
           | 
           | [astronaut with gun]
           | 
           | Always has been
        
       | logophobia wrote:
       | I've applied the S4 operator to successfully do long-length video
       | classification. It's massively more efficient than a similarly
       | scaled transformer, but it doesn't train as well. Still, even
       | with S4 I got some impressive results, looking forward to more.
        
       | Buttons840 wrote:
       | If I want to do sequence modelling, let's say, predict the 9th
       | element of the following sequence: [1, 2, 3, 4, 5, 6, 7, 8, 9] --
       | that is, I know the 8 most recent tokens as my "context", and I
       | want to predict the next, 9 in this case --
       | 
       | Can someone explain to me why a transformer or RNN is better at
       | this than a simple linear layer with an equivalent number of
       | parameters? A linear layer can receive the context [1, 2, 3, 4,
       | 5, 6, 7, 8], properly one-hot encoded / embedded, etc, and
       | predict the next sequence. Can a linear layer do just as well as
       | a transformer? This setup allows linear layers to predict
       | sequences with an arbitrary context size, so why so much hype
       | about transformers and RNNs and other sequence focused
       | architectures?
       | 
       | Perhaps the difference is that given the same number of
       | parameters, the transformer uses those parameters to perform easy
       | computations whereas the linear layer just does one gigantic
       | matrix multiplication which isn't very efficient?
        
         | QuadmasterXLII wrote:
         | The big difference is that the transformer is (approximately)
         | permutation equivariant, which makes a massive difference in
         | generalization and training speed.
        
           | Buttons840 wrote:
           | I see, so [1, 2, 3, 4, 5, 6, 7, 8] is more similar to [5, 3,
           | 1, 7, 8, 2, 4, 6] than with a linear layer? That's what you
           | mean by permutation equivariant?
           | 
           | I understand each context input is embedded with its
           | position, but I suppose the transformer can learn to ignore
           | the position and just look at the context as an unordered
           | set?
        
           | eachro wrote:
           | What makes it approximately permutation equivariant (vs
           | entirely)? As I understand things, if the order is jumbled,
           | the attention matrix does get its rows and cols permuted in
           | the way you'd expect so I'd have thought they'd be entirely
           | permutation equivariant.
        
             | YetAnotherNick wrote:
             | Inputs have position encoding in them.
        
       | 1024core wrote:
       | While the race to incorporate longer and longer context (2K PaLM
       | -> 32K now) is interesting, I don't think that'll scale. It'll
       | just add too much noise to the history: how do you establish a
       | causal relationship between what you're holding in your hand,
       | versus the million other things (context) that you've seen in the
       | past. You'll end up with spurious correlations.
       | 
       | What I think (and this is just me talking out of my ass) will be
       | required is some form of associative long-term memory. Basically,
       | give the model a way to store some embeddings in some form of
       | memory, and then retrieve them based on context: so it doesn't
       | matter if you encountered that item 2 tokens ago, or 2B.
       | 
       | At least this is what my current intuition tells me.
        
         | lucidrains wrote:
         | that line of research is still going.
         | https://github.com/lucidrains/block-recurrent-transformer-py...
         | i think it is worth continuing research on both fronts.
        
           | 1024core wrote:
           | Of course, I'm not saying "don't do research". I'm just
           | saying that I don't think this context-length war will lead
           | us to long-term sustainable gains.
        
             | [deleted]
        
         | nathias wrote:
         | this and evaluation based on context that lets it mutate past
         | content and we are set
        
         | skybrian wrote:
         | On the other hand, it seems like training on large amounts of
         | text for next-token prediction would tend to reduce reliance on
         | spurious correlations? I don't think this intuitive sort of
         | speculation can predict what it will do.
        
       | edulix wrote:
       | Instead of long learning or long contexts, at some point
       | artificial neural networks will have to transition to
       | continuous/online learning - learn while using the network. This
       | way, limitations are broken like they are in our minds.
       | 
       | Similar to what Numenta HTM networks do, but scalable and
       | performant for real use cases.
       | 
       | BTW, perhaps human-like conscience emerge as a "self-attention-
       | like" mechanism between context and learning. Just saying.
        
         | qumpis wrote:
         | Learn how? I think having infinite context is perfect - no need
         | to learn on my data online and risk exposing it to others.
        
         | thomasahle wrote:
         | Alternatively we need the model to have a long term memory, and
         | be able to load stuff to/from that while reading.
        
       | [deleted]
        
       | jmole wrote:
       | Oddly enough, I was reading their paper just last night:
       | https://arxiv.org/pdf/2302.10866.pdf
       | 
       | I think we're going to see a lot more in the
       | wavelet/convolution/fft space when thinking about how to increase
       | context length.
       | 
       | I think there's also a lot of room for innovation in the
       | positional encoding and how it's represented in transformer
       | models, it seems like people have been trying lots of things and
       | going with what works, but most of it is like: "look, a new
       | orthonormal basis!".
       | 
       | Hyena sort of seems like the first step in moving to positional
       | _embeddings_ (or joint positional /attentional embeddings).
       | 
       | Very cool work.
        
         | cs702 wrote:
         | I agree this sort of approach looks promising. Maybe using FFTs
         | recurrently to approximate convolutions with input-length
         | filters is the way forward. It's a clever idea. I'm making my
         | way through the paper. Don't fully understand it yet.
         | 
         | The main issue I've seen with other wannabe-sub-quadratic-
         | replacements for self-attention is that they all rely on some
         | kind of low-rank/sparse approximation that in practice renders
         | LLMs incapable of modeling enough pairwise relationships
         | between tokens to achieve state-of-the-art performance.
         | 
         | I'm curious to see if this kind of approach solves the issue.
        
       | imustachyou wrote:
       | S4 and its class of state-space models are an impressive
       | mathematical and signal-processing innovation, and I thought it
       | was awesome how they destroyed previous baselines for long-range
       | tasks.
       | 
       | Have there been any state-space models adapted for arbitrary text
       | generation?
       | 
       | Language models like ChatGPT are trained to predict new words
       | based on the previous ones and are excellent for generation, a
       | harder task than translation or classification. I'm doubtful
       | about the adaptability of text models that deal with fixed-sized
       | input/outputs and don't have an architecture that is as natural
       | for generating indefinitely long sequences.
        
         | sdenton4 wrote:
         | Go read about S4, from these authors. It's about having a
         | learnable state-space model which can be efficiently
         | implemented as either an RNN or (very long) convolution,
         | according to the needs of train or inference.
        
           | Buttons840 wrote:
           | Do these scale as well as transformers? My understanding is
           | that classic RNNs don't scale well, and that is one reason
           | why transformers became popular.
           | 
           | As a pleb who doesn't even own a data center, I've been
           | hoping that a superior machine learning architecture will be
           | discovered that doesn't scale well. We would be fortunate if
           | our personal computers end up being half as good as
           | Microsoft's or Amazon's best models; fortunate if the best
           | architecture gains little from an additional 10,000 GPUs.
           | This would help spread the benefits of AI evenly among anyone
           | with a phone or computer -- a utopia compared to the other
           | possibility, that everyone can learn how to build AI, but
           | only those with a few hundred million to throw at a data
           | center can actually control the means of production -- err, I
           | mean, the means of intelligence.
           | 
           | Philosophically, this wouldn't be unlike people. Humans are
           | still the greatest intelligence we're aware of, and humans
           | don't scale. I'm hoping computer intelligence ends up not
           | scaling well either.
        
             | sdenton4 wrote:
             | That's the point of having multiple realizations of the
             | same underlying model.
             | 
             | The (depthwise) convolutional realization is extremely
             | efficient for training, and the RNN is extremely efficient
             | for inference. The scaling in both of these cases is much
             | better than attention layers - as they discuss in the
             | article.
        
       | 3327 wrote:
       | [dead]
        
       | pmontra wrote:
       | Is this the way we work? We are told a fact a very few times and
       | we remember it all life long, no 32k or 32M context.
       | 
       | I think that they are following the easy path, much like the Giga
       | Hertz race in CPU, and will hit a wall. Maybe the wall will be so
       | far away that it will give us an AGI but maybe it will give us
       | superhuman machines only in well defined contexts. We'll have to
       | squeeze our instructions in a prompt too small for some tasks and
       | get a bot behaving like the main character of the Memento movie
       | (he remembered only the last few minutes and very old memories.)
        
         | [deleted]
        
       | Ozzie_osman wrote:
       | There will be those of us that understand how all these models
       | work, and there will be those of us that simply use them.
        
         | istjohn wrote:
         | That's true of just about any technology. Most carpenters would
         | fail to explain why their hammer drives a nail into wood
         | instead of bouncing off, why it has a handle that extends past
         | the hand's grip, why the head of the hammer neither shatters
         | nor mushrooms over time, to say nothing of their nail guns and
         | circular saws.
        
       | Herval_freire wrote:
       | Someone in a previous comment on LLM research said that according
       | to what he knew about LLM research that we were in a local
       | maximum that no further improvement was likely possible.
       | 
       | I disagreed with him, and this article is evidence that is in
       | favor of my point. If research like this continues to move
       | forward LLMs will improve at a rapid rate.
       | 
       | Different threads attract different groups of people with
       | different areas of expertise. So I will sort of reiterate this
       | topic here as I'm interested. What are most people's thoughts on
       | this "local maximum" thing have we actually hit a dead end?
       | Especially given the proliferation of effort towards producing
       | research like the one showed in the topic here.
        
         | ChatGTP wrote:
         | I mean I just read that article and it doesn't seem like a lot
         | will change, sure it can summarize a whole book, or read a
         | larger chunk of code to do thing with, but didn't really see it
         | talk about taking things to "the next level" so to speak.
         | 
         | The researchers are also, excited.
         | 
         |  _We're especially motivated by applications that could benefit
         | from longer-sequence models - high-resolution imaging, new
         | modalities of data, language models that can read entire books.
         | Imagine giving a language model an entire book and having it
         | summarize the plot, or conditioning a code generation model on
         | all the code you've ever written. The possibilities are wild -
         | and we're excited._
        
           | Herval_freire wrote:
           | But this research came out mere weeks after the release of
           | gpt-4. That is in itself rapid. If small incremental changes
           | like this continue on a sort of monthly basis, the trendline
           | points towards something that's not a dead end. That's my
           | view of it.
           | 
           | As with most technology there isn't necessarily always a
           | constant influx of inflection points and paradigm shifts.
           | Improvement will likely creep up on us incrementally.
           | Suddenly one day its clearly more intelligent then a human
           | and we can't point to when it happened.
        
             | [deleted]
        
             | jamilton wrote:
             | GPT-4's release doesn't seem like the relevant time marker,
             | since nothing in the article builds on it or depends on it.
             | The paper for H3 was submitted in December 2022.
             | 
             | The pace the last few years definitely seems rapid, just
             | don't want there to be a false impression.
        
         | intalentive wrote:
         | Once you exhaust the dataset of all written language, the next
         | step is multi-modal -- images, audio, video. What will next-
         | token prediction give us on such a dataset? Better versions of
         | what we have now -- style transfer, summarization, captioning,
         | translation, prompt-based generation, etc., but with
         | synesthesia.
         | 
         | There is still plenty of improvement ahead but I don't think
         | anything genuinely surprising will come from the current regime
         | of feedforward models. What is missing is action -- an
         | interactive feedback loop between agent and environment.
         | Progress in RL and robotics has been very slow by comparison
         | and unless we see a breakthrough there, I would guess the GPT
         | phase plateaus in the next 5-10 years.
        
           | skybrian wrote:
           | I expect progress will be much less predictable. Some kinds
           | of action look pretty easy; it depends on the domain.
           | 
           | For example, I expect skill at writing some kinds of code to
           | improve dramatically because running tests in a sandbox looks
           | easy. It's already being researched. [1] Extending that to
           | device drivers might be a bit harder. Fuzzing is already
           | mostly automated and smarter fuzzing could get pretty scary.
           | 
           | [1] https://nanothoughts.substack.com/p/reflecting-on-
           | reflexion
        
             | Salgat wrote:
             | Basically, if the knowledge exists online in a way that can
             | be pieced together in a straight forward manner, GPT will
             | figure it out, but for information that requires
             | experimentation and creating new information to derive
             | results, it won't be of much use. For example, GPT can't
             | iteratively try different programming techniques to speed
             | up a block of parallelizable code, it'll simply give you
             | the best guess that it can find off google.
        
               | skybrian wrote:
               | It won't iterate on its own, but you can do it. You can
               | ask it for a list of things to try, and they will be
               | different alternatives. You can also tell it the result
               | of an experiment and it will often figure out what to
               | fix.
               | 
               | If you follow the link I shared, some researchers
               | automated asking GPT4 to write tests, running the tests
               | in a sandbox, and feeding the results back in.
        
           | UncleEntity wrote:
           | > What will next-token prediction give us on such a dataset?
           | 
           | Haven't we pretty much figured out these things are doing
           | more than just predicting the next token at this point?
           | 
           | There's probably a lot to be done with a "prediction
           | machine", birds aren't all that smart but can catch bugs in
           | midair.
        
             | skybrian wrote:
             | People keep underestimating what next-token prediction can
             | do, but they're not wrong that it's how LLM's work.
             | 
             | It's actually a good question: what will next-token
             | prediction be able to do on new datasets? The error is
             | thinking you can answer it, even in broad terms.
        
       | cs702 wrote:
       | This looks really interesting! If these guys succeed in bringing
       | self-attention's computational cost down from O(n2) to O(n log
       | n), that would be a huge win. The quadratic cost makes it very
       | difficult to increase sequence length on current hardware. I'm
       | going to take a closer look.
       | 
       | There are other interesting ongoing efforts to increase sequence
       | length. One that has worked for me is this dynamic routing
       | algorithm, related to self-attention, that can handle sequences
       | with 1M+ tokens in a single GPU:
       | https://github.com/glassroom/heinsen_routing . Right now, you can
       | take 1,000 sequences of hidden states computed by a pretrained
       | transformer, each sequence with, say, 1024 tokens, concatenate
       | them into a single ultra-long sequence with 1,024,000 hidden
       | states, slap 1,024,000 position encodings on top, and feed the
       | whole thing to that routing algorithm to predict the next token
       | (or whatever other training objective you want to optimize for).
       | It works. Search the README for "Very Long Sequences".
       | 
       | If anyone here has other suggestions for working with long
       | sequences (hundreds of thousands to millions of tokens), _I 'd
       | love to learn about them_.
        
         | [deleted]
        
         | og_kalu wrote:
         | There are already linear attention advances. Gpt-4-32k is
         | almost certainly using some form of flash attention.
         | 
         | Attention isn't really O(n2) anymore.
        
           | cs702 wrote:
           | My understanding is that FlashAttention's memory use is
           | linear, or close to linear in practice, but computation is
           | still O(n2). I'm unaware of anyone being able to apply
           | FlashAttention on, say, a million tokens, because it must
           | execute ~1/2 x 1,000,000^2 x n_head dot-products, each in a
           | subspace with d_head dimensions. That's not exactly
           | computationally cheap!
        
             | og_kalu wrote:
             | No you're right. I mistook you. Compute isn't linear yet.
        
           | lucidrains wrote:
           | It is only linear in terms of memory, not compute. Flash
           | attention is a big advance, but not enough for 1 million
           | tokens
        
         | [deleted]
        
         | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-04-09 23:00 UTC)