[HN Gopher] Scaling Transformers to 1B Tokens
___________________________________________________________________
Scaling Transformers to 1B Tokens
Author : mottiden
Score : 177 points
Date : 2023-07-06 12:28 UTC (10 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| kytazo wrote:
| Is assuming the sequence length is directly correlated to the
| context window a meaningful thought?
|
| Does this imply similar increases in context in practice?
| gamegoblin wrote:
| The benefit of "traditional" O(N^2) transformer attention is you
| correlate every token to every other token. So, in the limit,
| your network won't "miss" much.
|
| When you abandon O(N^2) attention, you are forced to start adding
| heuristics to choose what to correlate. Any time you see one of
| those giant context window LLMs, you need to be asking what
| heuristics they added, what is getting correlated, and what is
| _not_ getting correlated.
|
| This paper chooses an exponential heuristic where tokens further
| in the past get exponentially less attention. This heuristic is
| fine for certain tasks like responding in a chat room, where the
| most recent tokens are the most important, but bad for tasks
| where tokens are roughly equally important throughout the text,
| such as a dense academic paper or a reference manual.
|
| The bitter lesson [1] is going to eventually come for all of
| these. Eventually we'll figure out how to machine-learn the
| heuristic rather than hard code it. Recurrent neural networks
| (RNNs) do this implicitly, but we don't yet know how to
| effectively train RNNs on ultra-deep sequences.
|
| Another possibility is learning a heuristic for non-recurrent
| LLMs via reinforcement learning, such as in [2], which is
| basically a reinforcement learned "auto-researcher" that was
| trained in a style reminiscent of AlphaGo.
|
| [1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
|
| [2] https://arxiv.org/pdf/2109.00527.pdf
| CuriouslyC wrote:
| It seems like building a context tree with a convex branch
| cross attention estimator then using branch and bound to prune
| the tree while descending to get exact cross attention when
| it's above a threshold would work pretty well, assuming the
| cross attention matrix actually is very sparse and the trouble
| is just accurately guessing the non-sparse elements.
| zzzzzzzza wrote:
| this sounds to me like a dollar cost averaging strategy -
| only buy in when the current price falls below an n-day
| moving average.
|
| I doubt there is any risk adjusted alpha to the strategy - in
| practice it's my, newbie, understanding that the only thing
| that differentiates such strategies in the broader scheme of
| things is tax efficiency.
|
| however I am also not a ML expert
| phillipcarter wrote:
| This comment makes so much sense relative to what I've seen
| with Claude's 1M context window. It reliably fails to succeed a
| task with a prompt where I just stuff in a big blob of data in
| the middle as context. But when I use emebddings to only select
| a small relevant subset of that data, it always passes the
| task.
| gamegoblin wrote:
| Yes, Claude 1M is using all sorts of approximation tricks to
| get that 1M context window. IMO this is actually quite
| deceptive marketing.
| l1n wrote:
| Claude's context is 100K not 1M [1]. If you're somehow
| shoving in a million tokens that could explain the issue
| you're having!
|
| [1] https://www.anthropic.com/index/100k-context-windows
| gamegoblin wrote:
| Misremembered, the main thrust of the comment still
| stands, the 100K context window isn't "real", it would be
| absurdly expensive to do it for real. They are using a
| lot of approximation tricks to get there.
| dpflan wrote:
| Yes, that's the point now for competing AI research-for-
| profit companies, whatever metric is technical and sounds
| important, is going to be used in marketing and valuation
| determinations. It will be explored for research I'm sure,
| and then determine its product viability. It's nice
| competition, but agree, that it can be deceptive.
| im3w1l wrote:
| These models seem to be able to cope with absolutely massive
| training sets, wheras the prompt input has to be quite small in
| comparison.
|
| I wonder if could leverage this state of affairs by shifting
| the prompt from input to training data. Like take a generic
| model, run a little bit of fine tuning on the prompt.
| hotstickyballs wrote:
| Recurrent networks are exponential. They also blow up and decay
| exponentially. So this is not necessarily worse than rnns
| novok wrote:
| One clever trick I've seen is where before the text the goes
| out of the context window, it gets summarized by the llm and
| then that smaller summary is put into the context window and is
| continuously updated. It also reminds me how human memory works
| too.
| mochomocha wrote:
| While I agree with the beginning of your post, you lost me
| here:
|
| > The bitter lesson [1] is going to eventually come for all of
| these. Eventually we'll figure out how to machine-learn the
| heuristic rather than hard code it.
|
| Inefficiently re-learning over and over patterns that can be
| more explicitly encoded as smart inductive biases for better
| sample efficiency is what ML research is.
|
| The "bitter lesson" doesn't mean "throw the towel and your
| brain and just buy more GPUs". It means that the inductive
| biases / modeling strategies that win will always be the ones
| that are more hardware-friendly.
| gamegoblin wrote:
| I agree with you that learning certain things is wasteful.
|
| For instance, one could imagine an RNN that learned to do
| some approximation of tree search for game playing Chess and
| Go. But we have _very_ good reason to think that tree search
| is basically exactly what you want, so even systems like
| AlphaGo have the tree search implemented outside the neural
| net, but still using a learned system to heuristically guide
| the tree search.
|
| The reference to the bitter lesson here is that feature
| engineering has, thus far, typically lost out to more general
| end-to-end methods in the long run.
|
| This paper tries to do feature engineering by hand-coding an
| exponentially decaying mechanism, where tokens further in the
| past are assumed to be less important.
|
| My comment is that this type of hand-engineering will lose
| out to methods that are more end-to-end learned. These
| methods do not necessarily need to be hugely computationally
| intensive ("buy more GPUs").
|
| That said, I could see it being the case that in the short
| term, we do just buy more GPUs, learn a general end-to-end
| algorithm, but eventually figure out how to re-implement that
| end-to-end learned algorithm in code significantly more
| efficiently.
| og_kalu wrote:
| By and large, we don't really know what inductive biases we
| ought to be shoving in to models. Sometimes we think we do,
| but we're wrong more often than not. So methods with the
| least inductive biases work better.
| littlestymaar wrote:
| Not disagreeing with your comment in general, but this
| particular sentence annoys me a bit:
|
| > where tokens are roughly equally important throughout the
| text, such as a dense academic paper or a reference manual.
|
| Even in these, not all tokens are equal, most of a text is
| actually pretty low-information, with key packs of token that
| contain most of the information that you're going to need
| throughout the entire text (that's why we use highlighters when
| learning). And that's why those O(n2) attention are pretty
| wasteful, at the same time, you need to be able to pick the
| _proper_ token, and I agree with you that picking them through
| simple heuristic is probably not going to be enough.
| gamegoblin wrote:
| Better phrasing would have been "the important tokens are
| roughly evenly distributed throughout the text", that was the
| intended reading.
| bee_rider wrote:
| Are you thinking more like a research paper or more like a
| textbook?
|
| For a textbook at least, it often seems to be the case that
| you need to have fully ingested the big picture ideas of
| one chapter to move on to some later ones, but this seems
| to me at least more like updating your model, rather than
| sampling context from the whole book (I mean it is an
| analogy of course, so neither matches perfectly).
| AndrewKemendo wrote:
| Having studied Sutton for a long time now, what I take away
| from the bitter lesson is that the only pathway to Generally
| capable agents is to have the same scale of computational
| capacity in an embodied system as humans or other intelligent
| systems have.
|
| It's effectively a product of physics and so we keep trying to
| outsmart physics is what Suttons point is - and you just can't
| outsmart physics
|
| So, while the method probably is important in terms of
| efficiency or functionality within the current state of
| technological systems, the method is less important than the
| scale, and we're not even close to the scale necessary yet.
| sdenton4 wrote:
| I don't think it's obvious that we don't have sufficient
| computational scale already...
|
| The human brain has ~86 billion neurons, but they only fire
| at like 2Hz, so you get 192 billion firings per second. GPT-3
| has 175 billion parameters, and can apply those parameters
| much faster than the brain can fire neurons.
|
| Lots of folks like to point out that there's more complexity
| in neurons than model weights, which is fine, but it's not
| clear what the impact of that additional complexity actually
| is. Extra slow channels (eg, hormonal signals) also aren't an
| impossibility.
|
| So /maybe/ the models need to scale more, or maybe we need to
| figure out some better combination of training tasks to get
| the next big leap. There's massive progress being made with
| multi-modal inputs (which helps the model create a more
| coherent world-model, by relating text to images, audio, or
| video). Data selection. - picking good data instead of just
| throwing in everything - is also showing lots of promise.
|
| I tend to think there's some need for more 'interactive' or
| 'social' component to training, eg active/online learning
| with robotics - and figuring out how to get models to 'play'.
| Unstructured play is an essential mechanism in smart animals
| associated with hormonal signals and rewards - it's
| important, and we don't really know how to harness it yet.
|
| But overall, I don't think we're yet at a local maximum.
| There's too much cool stuff going on and the iron is still
| quite hot.
| AndrewKemendo wrote:
| "Embodied" being one of the key things that you're ignoring
|
| The brain =\= a human agent
|
| You need sensors and effectors and a highly variable motor
| system
|
| You can't be "generally intelligent" if you do not have
| boundaries on your computing system which are mobile and
| have independent actions.
|
| In order to perform as well if not better than a human then
| you need to perform as well, if not better than a human in
| all possible environments, and those include the top of the
| world, the bottom of the ocean, every factory line, flying
| airplanes, etc...
| sdenton4 wrote:
| How could you do anything intelligent without a strong
| beak?
|
| https://falseknees.tumblr.com/post/654380023602708480
|
| The interactivity I mentioned is the bit that I think is
| actually important from embodiment - the ability to take
| an action in the world, see the result, and adjust
| expectations. What you've called 'independent actions.'
|
| But there's certainly no proof that a general
| intelligence needs to be bounded and mobile - a pedantic
| thought-experiment-counterexample would be an 'uploaded'
| human mind: the people of San Junipero don't stop being
| generally intelligent once they are in a distributed
| simulation...
|
| More generally, we don't actually know the boundaries on
| how general intelligence could arise and what shape it
| could take, because we don't really understand
| intelligence at all.
| euclaise wrote:
| > The bitter lesson [1] is going to eventually come for all of
| these. Eventually we'll figure out how to machine-learn the
| heuristic rather than hard code it. Recurrent neural networks
| (RNNs) do this implicitly, but we don't yet know how to
| effectively train RNNs on ultra-deep sequences.
|
| Linear RNNs and RWKV are examples of RNNs on deep sequences:
|
| https://arxiv.org/abs/2303.06349
|
| https://arxiv.org/abs/2305.13048
| gamegoblin wrote:
| I think the jury is still out if these will actually scale to
| ultra-long language understanding sequences. KWKV, for
| example, is still _trained_ like GPT, but is architected so
| it can be run as an RNN during _inference_ time. This is
| awesome, but it is unclear if the training regime will limit
| the effective use of long-ranging recurrent context.
| euclaise wrote:
| Training as GPT vs RNN will give you numerically identical
| results with RWKV, it's just two ways of computing the same
| thing. It's trained in GPT-mode because it's cheaper to
| train that way -- you can parallelize over the sequence
| length. In practice it isn't going to be any different than
| training with back-propagation through time for the same
| sequence length.
| sdenton4 wrote:
| The work out of that group, starting with S4 layers, is
| 10000% the stuff to be paying attention to.
|
| https://srush.github.io/annotated-s4/
|
| HiPPO was brilliant - instead of working with the raw
| sequence, you work with its weighted laplace transform, and
| instead of actually computing the laplace transform you find
| the rule to update it when new data is added. Furthermore, we
| can 'band limit' the Laplace transform (similar to PCA) to
| keep only the 'most important' information while still
| preserving most of the information in the sequence - this is
| a common and quite effective compression technique.
|
| Any 'fast' transformer is going to be working with some kind
| of sampling or aggregation or compression of the long
| sequence. Sampling is ultimately going to be too noisy, and
| standard aggregations are going to be too coarse. So the
| thing to bet on is better compression techniques, which is
| what the S4/RWKV group are ultimately working on.
| inciampati wrote:
| Can you point to anything public on your last point about
| compression? What is being compressed?
| sdenton4 wrote:
| The sequence of model activations is being compressed. s4
| treats each activation channel as an independent
| sequence, and applies a learned version of the Laplace
| transform, and drops less-significant components.
|
| This is similar to basic compression you get with PCA or
| Fourier transforms. These transforms re fully invertible,
| until you drop the less significant components. Dropping
| less-significant components lets you reconstruct some
| degraded version of the input, and the transform makes it
| easy to pick the right components to drop.
| DennisP wrote:
| It doesn't sound to me like it's quite "tokens further in the
| past get exponentially less attention." What they say is
| "attention allocation decreases exponentially as the distance
| between tokens grows." Instead of being quadratic because every
| pair of tokens gets the same attention, the tokens farther
| apart _from each other_ get exponentially less. It doesn 't
| matter how far they are from the final token.
|
| This seems to me more like a general computational approach
| than a hand-coded heuristic. David Shapiro claims it's similar
| to how the brain works, and has a neat analogy for it here:
| https://www.youtube.com/watch?v=R0wBMDoFkP0
| refulgentis wrote:
| This is intriguing but I don't quite follow - really naive,
| but:
|
| isn't the final token as some position N?
|
| And given context size limit Y, when we generate the next
| token, right now I get attention from N - Y to N?
|
| And this supposes I get attention from 0 to N, but the
| attention decreases exponentially as we approach token 0?
| itissid wrote:
| I would like to take a parallel view to the BitterLesson and
| how its playing out. There are exceptions. Its not only
| computation but also a mix of:
|
| 1. decades of theoretical breakthroughs coming together. 2.
| there is also collective human creativity and perseverance.
|
| Like Yann Le Cunn, Geoff Hinton etc have been working since the
| 90's and there were several milestones that were hit and it
| only caught on fire/went on steroids once the application(and
| the associated funding) was found due to creativity in the tech
| sector. But if the computation was somehow available before I
| am not sure it would have happened so quickly.
|
| Another example is that all methods under the AI umbrella are
| not dependent on crazy amounts of computation and data. Take
| the field of AutoRegressive models in Social/Life Sciences
| field. For example lets look at the STAN which broadly does
| heirarchical Bayesian Inference using MonteCarlo based methods
| in social science.
|
| It took some hard theoretical advancements to move the needle
| on MonteCarlo Simulation methods like detecting convergence and
| ability to have non conjugated priors for posterior sampling to
| work etc. The new methods are better by leaps and bounds over
| the conventional methods in the field. The computation for
| running the modern models from 2013 would be enough to run em
| for most cases.
| sashank_1509 wrote:
| Both your points are not really valid. There have been
| decades of theoretical breakthroughs in computational
| linguistics too (Have there been any in Deep Learning?).
| There has also been a large amount of human creativity and
| perseverance in computational linguistics, arguably more than
| the amount I have seen in Deep Learning. Yet, not one useful
| algorithm has come from linguistics. In fact the old adage on
| speech processing can be applied to Natural Language
| Processing: "Every time I fire a linguist my performance
| improves by a few percent".
|
| The bitter lesson is bitter and important to keep in mind
| exactly because human creativity and perseverance do not
| matter in front of it. Consistently, the only methods that
| work are those that scale with computation, everything else
| does not matter. I would take a more extreme view, if
| computation didn't follow Moore's law, we wouldn't have
| invented alternate methods that do not require massive
| computation, we would just simply fail to do even the most
| basic tasks of intelligence and be stuck in the 1960s. A
| scary thought, but a true one I reckon. If computation kept
| following Moore's law but a few stalwarts like Yann Le Cun
| etc didn't exist, we would likely have found alternative
| architectures that scale and work, maybe not as good as
| ConvNets but transformers aren't as good as ConvNets either,
| they just need to scale.
| Majromax wrote:
| I'm not sure that the Bitter Lesson is the end of the
| story. The Bitter Corollary seems to be that scaling
| computation also requires scaling data.
|
| Sometimes that's easy; self-play in Go, for example, can
| generate essentially infinite data.
|
| On the other hand, sometimes data isn't infinite. It can
| _seem_ infinite, such as the aforementioned NLP work, where
| computation-heavy ML system can process more data than a
| human can read in their lifetime. However, our LLMs are
| already within an order of magnitude of reading every bit
| of human writing ever, and we 're scaling our way to that
| data limit.
|
| "Clever" human algorithms are all a way of doing more with
| less. People are still more data-efficient learners than
| large ML systems, and I'm less sure that we'll be able to
| compute our way to that kind of efficiency.
| sashank_1509 wrote:
| I think Geoffrey Hinton addresses this point well in his
| recent podcast with Pieter Abbeel. He says and I
| paraphrase, current Deep Learning methods are great at
| learning from large amounts of data with a relatively
| small amount of compute. Human brain on the other hand,
| with around 150 trillion synapses/ parameters has the
| opposing problem, parameters/ compute is cheap but data
| is expensive. It needs to learn a large amount from very
| less data and it is likely a large amount of
| regularization (things like dropout) will be required to
| do this without over-fitting. I think we will have a real
| shot at AGI once 100Trillion param models become feasible
| which might happen within this decade.
| antonevstigneev wrote:
| [dead]
| Imnimo wrote:
| Without any experiment showing that language modeling performance
| actually continues to improve past 32k tokens using this scheme,
| how are we supposed to tell whether this is actually viable?
| spuz wrote:
| What does the "number of tokens" characteristic of an LLM mean
| exactly? How does 1B compare with GPT-3.5 or GPT-4?
| rising-sky wrote:
| Context length / window. Think of them and the "number of
| words" that the model can effectively process. 1 token is
| roughly equal to 4 characters or 0.75 words for English text.
| The number of tokens is the total number that can fit into a
| context window, which again is the space of "input" i.e.
| prompts and output (response/ completions) that the model can
| handle
| esafak wrote:
| It means the maximum length of your query. The longer the
| context window the more complex questions you can pose. For
| example, being able to paste the text of a whole book and
| asking for a summary.
| PartiallyTyped wrote:
| A sequence of characters is encoded into tokens, tokens are
| grouped characters, each token is mapped to a vector
| representation. When you give text to an LLM, the text is
| encoded into tokens, and each token corresponds to an index.
| Each index corresponds to one vector. The model produces
| vectors, and then finds the most similar vector and selects the
| corresponding index as the next token.
|
| This is a spectrum, you can write a model that works on the bit
| level, so 2 vectors, or byte level, 256, or pairs of bytes,
| 2^16 and so on and so forth.
|
| These days, we use statistical approaches to build the tokens,
| and a token can be 1, 2 or 3 or N characters long.
|
| So when you give a sequence of characters to the model, it
| turns that to a sequence of tokens and loads a vector for each
| one, and when doing computations, it needs to consider all
| tokens together. This is called the context window.
|
| In this case, scaling the number of tokens means scaling the
| context window to a large number.
|
| GPT3.5 can do 2Ki tokens iirc, OpenAI's GPT4 can do 4Ki iirc,
| Claude from anthropic can do 1Mi iirc.
|
| The context window is kinda analogous to your working memory,
| the higher the better, unless there are approximations that
| trade off quality for length, which is what is happening here.
| treprinum wrote:
| Original GPT3.5 can do 4k tokens and there is a recent
| version with 16k tokens (gpt-3.5-turbo-16k)
| PartiallyTyped wrote:
| Ahhh thanks for the correction! And iirc GPT-4 has a 20k
| version too.
| cs702 wrote:
| Well, this looks promising. The key idea is to collect a
| different set of tokens, with different levels of sparsity for
| each head, apply regular (dense) self-attention over all heads,
| weighted by pairwise distance, and spread and add the output
| residuals to their corresponding location in the original
| sequence. It seems to work _really well_ , judging by the
| perplexity scores shown in the paper -- though we don't yet know
| if those perplexity scores will translate into good performance
| on real-world tasks.
|
| I'm going to take a closer look.
| climatologist wrote:
| Does anyone know if 1B tokens is enough to solve sudoku puzzles?
| WanderPanda wrote:
| How stupendous to not put the first figure on a log scale...
| oofbey wrote:
| Cute trick but the opposite of helpful. Is the goal of your
| paper to brag or educate?
| jeron wrote:
| this paper leans towards the former
| jumpCastle wrote:
| Title with a 10 digits number, meaningless first page figure and
| no experiments related to the main claim. Did a rogue author
| posted it without permission again?
| jeron wrote:
| As noted, they only did experiments up to 32k length which is
| silly considering the title
| jumpCastle wrote:
| Silly is a charitable interpretation.
| euclaise wrote:
| Important note: They only did experiments up to 32k length
| londons_explore wrote:
| They use perplexity on github data to demonstrate the
| effectiveness of their model.
|
| I suspect github data has a lot of copy pasted code. Ie. a good
| chunk of what you are asking the model to do is to go back X
| million tokens and copy a chunk verbatim.
|
| Sure, the model might also be looking back at some code X million
| tokens ago and using that to improve its guess of the next token
| (oh look, the API definition of the API I am using is back here,
| that'll help me get this right!).
|
| But the perplexity number alone doesn't differentiate those cases
| - and considering how much code copying/templating happens in
| software, I suspect that affects the perplexity a lot more than
| smartly using stuff from the context window.
|
| I wonder if these models work well on other kinds of data?
| bratao wrote:
| I need to carefully read the article, but sparse attention is an
| interesting technique that has been used previously (as in
| BigBird) but has often proved to perform (way) worse than full
| attention. The sliding component that performs full attention is
| indeed useful (much like the Blockwise Parallel Transformer), but
| the sparse patterns are elements that don't intuitively resonate
| with me.
|
| The model might select random words in the context. There's
| definitely a case where this could be unfortunate if it ends up
| selecting irrelevant words.
|
| The graph on the first page, in my opinion, seems like a needless
| flex
| londons_explore wrote:
| > The graph on the first page, in my opinion, seems like a
| needless flex
|
| Indeed - they used half of the cover page of their paper to
| show a chart which illustrates... nothing...
| [deleted]
| daemonk wrote:
| This is relevant also:
| https://hazyresearch.stanford.edu/blog/2023-03-07-hyena
___________________________________________________________________
(page generated 2023-07-06 23:00 UTC)