[HN Gopher] Scaling Transformers to 1B Tokens
       ___________________________________________________________________
        
       Scaling Transformers to 1B Tokens
        
       Author : mottiden
       Score  : 177 points
       Date   : 2023-07-06 12:28 UTC (10 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | kytazo wrote:
       | Is assuming the sequence length is directly correlated to the
       | context window a meaningful thought?
       | 
       | Does this imply similar increases in context in practice?
        
       | gamegoblin wrote:
       | The benefit of "traditional" O(N^2) transformer attention is you
       | correlate every token to every other token. So, in the limit,
       | your network won't "miss" much.
       | 
       | When you abandon O(N^2) attention, you are forced to start adding
       | heuristics to choose what to correlate. Any time you see one of
       | those giant context window LLMs, you need to be asking what
       | heuristics they added, what is getting correlated, and what is
       | _not_ getting correlated.
       | 
       | This paper chooses an exponential heuristic where tokens further
       | in the past get exponentially less attention. This heuristic is
       | fine for certain tasks like responding in a chat room, where the
       | most recent tokens are the most important, but bad for tasks
       | where tokens are roughly equally important throughout the text,
       | such as a dense academic paper or a reference manual.
       | 
       | The bitter lesson [1] is going to eventually come for all of
       | these. Eventually we'll figure out how to machine-learn the
       | heuristic rather than hard code it. Recurrent neural networks
       | (RNNs) do this implicitly, but we don't yet know how to
       | effectively train RNNs on ultra-deep sequences.
       | 
       | Another possibility is learning a heuristic for non-recurrent
       | LLMs via reinforcement learning, such as in [2], which is
       | basically a reinforcement learned "auto-researcher" that was
       | trained in a style reminiscent of AlphaGo.
       | 
       | [1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
       | 
       | [2] https://arxiv.org/pdf/2109.00527.pdf
        
         | CuriouslyC wrote:
         | It seems like building a context tree with a convex branch
         | cross attention estimator then using branch and bound to prune
         | the tree while descending to get exact cross attention when
         | it's above a threshold would work pretty well, assuming the
         | cross attention matrix actually is very sparse and the trouble
         | is just accurately guessing the non-sparse elements.
        
           | zzzzzzzza wrote:
           | this sounds to me like a dollar cost averaging strategy -
           | only buy in when the current price falls below an n-day
           | moving average.
           | 
           | I doubt there is any risk adjusted alpha to the strategy - in
           | practice it's my, newbie, understanding that the only thing
           | that differentiates such strategies in the broader scheme of
           | things is tax efficiency.
           | 
           | however I am also not a ML expert
        
         | phillipcarter wrote:
         | This comment makes so much sense relative to what I've seen
         | with Claude's 1M context window. It reliably fails to succeed a
         | task with a prompt where I just stuff in a big blob of data in
         | the middle as context. But when I use emebddings to only select
         | a small relevant subset of that data, it always passes the
         | task.
        
           | gamegoblin wrote:
           | Yes, Claude 1M is using all sorts of approximation tricks to
           | get that 1M context window. IMO this is actually quite
           | deceptive marketing.
        
             | l1n wrote:
             | Claude's context is 100K not 1M [1]. If you're somehow
             | shoving in a million tokens that could explain the issue
             | you're having!
             | 
             | [1] https://www.anthropic.com/index/100k-context-windows
        
               | gamegoblin wrote:
               | Misremembered, the main thrust of the comment still
               | stands, the 100K context window isn't "real", it would be
               | absurdly expensive to do it for real. They are using a
               | lot of approximation tricks to get there.
        
             | dpflan wrote:
             | Yes, that's the point now for competing AI research-for-
             | profit companies, whatever metric is technical and sounds
             | important, is going to be used in marketing and valuation
             | determinations. It will be explored for research I'm sure,
             | and then determine its product viability. It's nice
             | competition, but agree, that it can be deceptive.
        
         | im3w1l wrote:
         | These models seem to be able to cope with absolutely massive
         | training sets, wheras the prompt input has to be quite small in
         | comparison.
         | 
         | I wonder if could leverage this state of affairs by shifting
         | the prompt from input to training data. Like take a generic
         | model, run a little bit of fine tuning on the prompt.
        
         | hotstickyballs wrote:
         | Recurrent networks are exponential. They also blow up and decay
         | exponentially. So this is not necessarily worse than rnns
        
         | novok wrote:
         | One clever trick I've seen is where before the text the goes
         | out of the context window, it gets summarized by the llm and
         | then that smaller summary is put into the context window and is
         | continuously updated. It also reminds me how human memory works
         | too.
        
         | mochomocha wrote:
         | While I agree with the beginning of your post, you lost me
         | here:
         | 
         | > The bitter lesson [1] is going to eventually come for all of
         | these. Eventually we'll figure out how to machine-learn the
         | heuristic rather than hard code it.
         | 
         | Inefficiently re-learning over and over patterns that can be
         | more explicitly encoded as smart inductive biases for better
         | sample efficiency is what ML research is.
         | 
         | The "bitter lesson" doesn't mean "throw the towel and your
         | brain and just buy more GPUs". It means that the inductive
         | biases / modeling strategies that win will always be the ones
         | that are more hardware-friendly.
        
           | gamegoblin wrote:
           | I agree with you that learning certain things is wasteful.
           | 
           | For instance, one could imagine an RNN that learned to do
           | some approximation of tree search for game playing Chess and
           | Go. But we have _very_ good reason to think that tree search
           | is basically exactly what you want, so even systems like
           | AlphaGo have the tree search implemented outside the neural
           | net, but still using a learned system to heuristically guide
           | the tree search.
           | 
           | The reference to the bitter lesson here is that feature
           | engineering has, thus far, typically lost out to more general
           | end-to-end methods in the long run.
           | 
           | This paper tries to do feature engineering by hand-coding an
           | exponentially decaying mechanism, where tokens further in the
           | past are assumed to be less important.
           | 
           | My comment is that this type of hand-engineering will lose
           | out to methods that are more end-to-end learned. These
           | methods do not necessarily need to be hugely computationally
           | intensive ("buy more GPUs").
           | 
           | That said, I could see it being the case that in the short
           | term, we do just buy more GPUs, learn a general end-to-end
           | algorithm, but eventually figure out how to re-implement that
           | end-to-end learned algorithm in code significantly more
           | efficiently.
        
           | og_kalu wrote:
           | By and large, we don't really know what inductive biases we
           | ought to be shoving in to models. Sometimes we think we do,
           | but we're wrong more often than not. So methods with the
           | least inductive biases work better.
        
         | littlestymaar wrote:
         | Not disagreeing with your comment in general, but this
         | particular sentence annoys me a bit:
         | 
         | > where tokens are roughly equally important throughout the
         | text, such as a dense academic paper or a reference manual.
         | 
         | Even in these, not all tokens are equal, most of a text is
         | actually pretty low-information, with key packs of token that
         | contain most of the information that you're going to need
         | throughout the entire text (that's why we use highlighters when
         | learning). And that's why those O(n2) attention are pretty
         | wasteful, at the same time, you need to be able to pick the
         | _proper_ token, and I agree with you that picking them through
         | simple heuristic is probably not going to be enough.
        
           | gamegoblin wrote:
           | Better phrasing would have been "the important tokens are
           | roughly evenly distributed throughout the text", that was the
           | intended reading.
        
             | bee_rider wrote:
             | Are you thinking more like a research paper or more like a
             | textbook?
             | 
             | For a textbook at least, it often seems to be the case that
             | you need to have fully ingested the big picture ideas of
             | one chapter to move on to some later ones, but this seems
             | to me at least more like updating your model, rather than
             | sampling context from the whole book (I mean it is an
             | analogy of course, so neither matches perfectly).
        
         | AndrewKemendo wrote:
         | Having studied Sutton for a long time now, what I take away
         | from the bitter lesson is that the only pathway to Generally
         | capable agents is to have the same scale of computational
         | capacity in an embodied system as humans or other intelligent
         | systems have.
         | 
         | It's effectively a product of physics and so we keep trying to
         | outsmart physics is what Suttons point is - and you just can't
         | outsmart physics
         | 
         | So, while the method probably is important in terms of
         | efficiency or functionality within the current state of
         | technological systems, the method is less important than the
         | scale, and we're not even close to the scale necessary yet.
        
           | sdenton4 wrote:
           | I don't think it's obvious that we don't have sufficient
           | computational scale already...
           | 
           | The human brain has ~86 billion neurons, but they only fire
           | at like 2Hz, so you get 192 billion firings per second. GPT-3
           | has 175 billion parameters, and can apply those parameters
           | much faster than the brain can fire neurons.
           | 
           | Lots of folks like to point out that there's more complexity
           | in neurons than model weights, which is fine, but it's not
           | clear what the impact of that additional complexity actually
           | is. Extra slow channels (eg, hormonal signals) also aren't an
           | impossibility.
           | 
           | So /maybe/ the models need to scale more, or maybe we need to
           | figure out some better combination of training tasks to get
           | the next big leap. There's massive progress being made with
           | multi-modal inputs (which helps the model create a more
           | coherent world-model, by relating text to images, audio, or
           | video). Data selection. - picking good data instead of just
           | throwing in everything - is also showing lots of promise.
           | 
           | I tend to think there's some need for more 'interactive' or
           | 'social' component to training, eg active/online learning
           | with robotics - and figuring out how to get models to 'play'.
           | Unstructured play is an essential mechanism in smart animals
           | associated with hormonal signals and rewards - it's
           | important, and we don't really know how to harness it yet.
           | 
           | But overall, I don't think we're yet at a local maximum.
           | There's too much cool stuff going on and the iron is still
           | quite hot.
        
             | AndrewKemendo wrote:
             | "Embodied" being one of the key things that you're ignoring
             | 
             | The brain =\= a human agent
             | 
             | You need sensors and effectors and a highly variable motor
             | system
             | 
             | You can't be "generally intelligent" if you do not have
             | boundaries on your computing system which are mobile and
             | have independent actions.
             | 
             | In order to perform as well if not better than a human then
             | you need to perform as well, if not better than a human in
             | all possible environments, and those include the top of the
             | world, the bottom of the ocean, every factory line, flying
             | airplanes, etc...
        
               | sdenton4 wrote:
               | How could you do anything intelligent without a strong
               | beak?
               | 
               | https://falseknees.tumblr.com/post/654380023602708480
               | 
               | The interactivity I mentioned is the bit that I think is
               | actually important from embodiment - the ability to take
               | an action in the world, see the result, and adjust
               | expectations. What you've called 'independent actions.'
               | 
               | But there's certainly no proof that a general
               | intelligence needs to be bounded and mobile - a pedantic
               | thought-experiment-counterexample would be an 'uploaded'
               | human mind: the people of San Junipero don't stop being
               | generally intelligent once they are in a distributed
               | simulation...
               | 
               | More generally, we don't actually know the boundaries on
               | how general intelligence could arise and what shape it
               | could take, because we don't really understand
               | intelligence at all.
        
         | euclaise wrote:
         | > The bitter lesson [1] is going to eventually come for all of
         | these. Eventually we'll figure out how to machine-learn the
         | heuristic rather than hard code it. Recurrent neural networks
         | (RNNs) do this implicitly, but we don't yet know how to
         | effectively train RNNs on ultra-deep sequences.
         | 
         | Linear RNNs and RWKV are examples of RNNs on deep sequences:
         | 
         | https://arxiv.org/abs/2303.06349
         | 
         | https://arxiv.org/abs/2305.13048
        
           | gamegoblin wrote:
           | I think the jury is still out if these will actually scale to
           | ultra-long language understanding sequences. KWKV, for
           | example, is still _trained_ like GPT, but is architected so
           | it can be run as an RNN during _inference_ time. This is
           | awesome, but it is unclear if the training regime will limit
           | the effective use of long-ranging recurrent context.
        
             | euclaise wrote:
             | Training as GPT vs RNN will give you numerically identical
             | results with RWKV, it's just two ways of computing the same
             | thing. It's trained in GPT-mode because it's cheaper to
             | train that way -- you can parallelize over the sequence
             | length. In practice it isn't going to be any different than
             | training with back-propagation through time for the same
             | sequence length.
        
           | sdenton4 wrote:
           | The work out of that group, starting with S4 layers, is
           | 10000% the stuff to be paying attention to.
           | 
           | https://srush.github.io/annotated-s4/
           | 
           | HiPPO was brilliant - instead of working with the raw
           | sequence, you work with its weighted laplace transform, and
           | instead of actually computing the laplace transform you find
           | the rule to update it when new data is added. Furthermore, we
           | can 'band limit' the Laplace transform (similar to PCA) to
           | keep only the 'most important' information while still
           | preserving most of the information in the sequence - this is
           | a common and quite effective compression technique.
           | 
           | Any 'fast' transformer is going to be working with some kind
           | of sampling or aggregation or compression of the long
           | sequence. Sampling is ultimately going to be too noisy, and
           | standard aggregations are going to be too coarse. So the
           | thing to bet on is better compression techniques, which is
           | what the S4/RWKV group are ultimately working on.
        
             | inciampati wrote:
             | Can you point to anything public on your last point about
             | compression? What is being compressed?
        
               | sdenton4 wrote:
               | The sequence of model activations is being compressed. s4
               | treats each activation channel as an independent
               | sequence, and applies a learned version of the Laplace
               | transform, and drops less-significant components.
               | 
               | This is similar to basic compression you get with PCA or
               | Fourier transforms. These transforms re fully invertible,
               | until you drop the less significant components. Dropping
               | less-significant components lets you reconstruct some
               | degraded version of the input, and the transform makes it
               | easy to pick the right components to drop.
        
         | DennisP wrote:
         | It doesn't sound to me like it's quite "tokens further in the
         | past get exponentially less attention." What they say is
         | "attention allocation decreases exponentially as the distance
         | between tokens grows." Instead of being quadratic because every
         | pair of tokens gets the same attention, the tokens farther
         | apart _from each other_ get exponentially less. It doesn 't
         | matter how far they are from the final token.
         | 
         | This seems to me more like a general computational approach
         | than a hand-coded heuristic. David Shapiro claims it's similar
         | to how the brain works, and has a neat analogy for it here:
         | https://www.youtube.com/watch?v=R0wBMDoFkP0
        
           | refulgentis wrote:
           | This is intriguing but I don't quite follow - really naive,
           | but:
           | 
           | isn't the final token as some position N?
           | 
           | And given context size limit Y, when we generate the next
           | token, right now I get attention from N - Y to N?
           | 
           | And this supposes I get attention from 0 to N, but the
           | attention decreases exponentially as we approach token 0?
        
         | itissid wrote:
         | I would like to take a parallel view to the BitterLesson and
         | how its playing out. There are exceptions. Its not only
         | computation but also a mix of:
         | 
         | 1. decades of theoretical breakthroughs coming together. 2.
         | there is also collective human creativity and perseverance.
         | 
         | Like Yann Le Cunn, Geoff Hinton etc have been working since the
         | 90's and there were several milestones that were hit and it
         | only caught on fire/went on steroids once the application(and
         | the associated funding) was found due to creativity in the tech
         | sector. But if the computation was somehow available before I
         | am not sure it would have happened so quickly.
         | 
         | Another example is that all methods under the AI umbrella are
         | not dependent on crazy amounts of computation and data. Take
         | the field of AutoRegressive models in Social/Life Sciences
         | field. For example lets look at the STAN which broadly does
         | heirarchical Bayesian Inference using MonteCarlo based methods
         | in social science.
         | 
         | It took some hard theoretical advancements to move the needle
         | on MonteCarlo Simulation methods like detecting convergence and
         | ability to have non conjugated priors for posterior sampling to
         | work etc. The new methods are better by leaps and bounds over
         | the conventional methods in the field. The computation for
         | running the modern models from 2013 would be enough to run em
         | for most cases.
        
           | sashank_1509 wrote:
           | Both your points are not really valid. There have been
           | decades of theoretical breakthroughs in computational
           | linguistics too (Have there been any in Deep Learning?).
           | There has also been a large amount of human creativity and
           | perseverance in computational linguistics, arguably more than
           | the amount I have seen in Deep Learning. Yet, not one useful
           | algorithm has come from linguistics. In fact the old adage on
           | speech processing can be applied to Natural Language
           | Processing: "Every time I fire a linguist my performance
           | improves by a few percent".
           | 
           | The bitter lesson is bitter and important to keep in mind
           | exactly because human creativity and perseverance do not
           | matter in front of it. Consistently, the only methods that
           | work are those that scale with computation, everything else
           | does not matter. I would take a more extreme view, if
           | computation didn't follow Moore's law, we wouldn't have
           | invented alternate methods that do not require massive
           | computation, we would just simply fail to do even the most
           | basic tasks of intelligence and be stuck in the 1960s. A
           | scary thought, but a true one I reckon. If computation kept
           | following Moore's law but a few stalwarts like Yann Le Cun
           | etc didn't exist, we would likely have found alternative
           | architectures that scale and work, maybe not as good as
           | ConvNets but transformers aren't as good as ConvNets either,
           | they just need to scale.
        
             | Majromax wrote:
             | I'm not sure that the Bitter Lesson is the end of the
             | story. The Bitter Corollary seems to be that scaling
             | computation also requires scaling data.
             | 
             | Sometimes that's easy; self-play in Go, for example, can
             | generate essentially infinite data.
             | 
             | On the other hand, sometimes data isn't infinite. It can
             | _seem_ infinite, such as the aforementioned NLP work, where
             | computation-heavy ML system can process more data than a
             | human can read in their lifetime. However, our LLMs are
             | already within an order of magnitude of reading every bit
             | of human writing ever, and we 're scaling our way to that
             | data limit.
             | 
             | "Clever" human algorithms are all a way of doing more with
             | less. People are still more data-efficient learners than
             | large ML systems, and I'm less sure that we'll be able to
             | compute our way to that kind of efficiency.
        
               | sashank_1509 wrote:
               | I think Geoffrey Hinton addresses this point well in his
               | recent podcast with Pieter Abbeel. He says and I
               | paraphrase, current Deep Learning methods are great at
               | learning from large amounts of data with a relatively
               | small amount of compute. Human brain on the other hand,
               | with around 150 trillion synapses/ parameters has the
               | opposing problem, parameters/ compute is cheap but data
               | is expensive. It needs to learn a large amount from very
               | less data and it is likely a large amount of
               | regularization (things like dropout) will be required to
               | do this without over-fitting. I think we will have a real
               | shot at AGI once 100Trillion param models become feasible
               | which might happen within this decade.
        
         | antonevstigneev wrote:
         | [dead]
        
       | Imnimo wrote:
       | Without any experiment showing that language modeling performance
       | actually continues to improve past 32k tokens using this scheme,
       | how are we supposed to tell whether this is actually viable?
        
       | spuz wrote:
       | What does the "number of tokens" characteristic of an LLM mean
       | exactly? How does 1B compare with GPT-3.5 or GPT-4?
        
         | rising-sky wrote:
         | Context length / window. Think of them and the "number of
         | words" that the model can effectively process. 1 token is
         | roughly equal to 4 characters or 0.75 words for English text.
         | The number of tokens is the total number that can fit into a
         | context window, which again is the space of "input" i.e.
         | prompts and output (response/ completions) that the model can
         | handle
        
         | esafak wrote:
         | It means the maximum length of your query. The longer the
         | context window the more complex questions you can pose. For
         | example, being able to paste the text of a whole book and
         | asking for a summary.
        
         | PartiallyTyped wrote:
         | A sequence of characters is encoded into tokens, tokens are
         | grouped characters, each token is mapped to a vector
         | representation. When you give text to an LLM, the text is
         | encoded into tokens, and each token corresponds to an index.
         | Each index corresponds to one vector. The model produces
         | vectors, and then finds the most similar vector and selects the
         | corresponding index as the next token.
         | 
         | This is a spectrum, you can write a model that works on the bit
         | level, so 2 vectors, or byte level, 256, or pairs of bytes,
         | 2^16 and so on and so forth.
         | 
         | These days, we use statistical approaches to build the tokens,
         | and a token can be 1, 2 or 3 or N characters long.
         | 
         | So when you give a sequence of characters to the model, it
         | turns that to a sequence of tokens and loads a vector for each
         | one, and when doing computations, it needs to consider all
         | tokens together. This is called the context window.
         | 
         | In this case, scaling the number of tokens means scaling the
         | context window to a large number.
         | 
         | GPT3.5 can do 2Ki tokens iirc, OpenAI's GPT4 can do 4Ki iirc,
         | Claude from anthropic can do 1Mi iirc.
         | 
         | The context window is kinda analogous to your working memory,
         | the higher the better, unless there are approximations that
         | trade off quality for length, which is what is happening here.
        
           | treprinum wrote:
           | Original GPT3.5 can do 4k tokens and there is a recent
           | version with 16k tokens (gpt-3.5-turbo-16k)
        
             | PartiallyTyped wrote:
             | Ahhh thanks for the correction! And iirc GPT-4 has a 20k
             | version too.
        
       | cs702 wrote:
       | Well, this looks promising. The key idea is to collect a
       | different set of tokens, with different levels of sparsity for
       | each head, apply regular (dense) self-attention over all heads,
       | weighted by pairwise distance, and spread and add the output
       | residuals to their corresponding location in the original
       | sequence. It seems to work _really well_ , judging by the
       | perplexity scores shown in the paper -- though we don't yet know
       | if those perplexity scores will translate into good performance
       | on real-world tasks.
       | 
       | I'm going to take a closer look.
        
       | climatologist wrote:
       | Does anyone know if 1B tokens is enough to solve sudoku puzzles?
        
       | WanderPanda wrote:
       | How stupendous to not put the first figure on a log scale...
        
         | oofbey wrote:
         | Cute trick but the opposite of helpful. Is the goal of your
         | paper to brag or educate?
        
           | jeron wrote:
           | this paper leans towards the former
        
       | jumpCastle wrote:
       | Title with a 10 digits number, meaningless first page figure and
       | no experiments related to the main claim. Did a rogue author
       | posted it without permission again?
        
         | jeron wrote:
         | As noted, they only did experiments up to 32k length which is
         | silly considering the title
        
           | jumpCastle wrote:
           | Silly is a charitable interpretation.
        
       | euclaise wrote:
       | Important note: They only did experiments up to 32k length
        
       | londons_explore wrote:
       | They use perplexity on github data to demonstrate the
       | effectiveness of their model.
       | 
       | I suspect github data has a lot of copy pasted code. Ie. a good
       | chunk of what you are asking the model to do is to go back X
       | million tokens and copy a chunk verbatim.
       | 
       | Sure, the model might also be looking back at some code X million
       | tokens ago and using that to improve its guess of the next token
       | (oh look, the API definition of the API I am using is back here,
       | that'll help me get this right!).
       | 
       | But the perplexity number alone doesn't differentiate those cases
       | - and considering how much code copying/templating happens in
       | software, I suspect that affects the perplexity a lot more than
       | smartly using stuff from the context window.
       | 
       | I wonder if these models work well on other kinds of data?
        
       | bratao wrote:
       | I need to carefully read the article, but sparse attention is an
       | interesting technique that has been used previously (as in
       | BigBird) but has often proved to perform (way) worse than full
       | attention. The sliding component that performs full attention is
       | indeed useful (much like the Blockwise Parallel Transformer), but
       | the sparse patterns are elements that don't intuitively resonate
       | with me.
       | 
       | The model might select random words in the context. There's
       | definitely a case where this could be unfortunate if it ends up
       | selecting irrelevant words.
       | 
       | The graph on the first page, in my opinion, seems like a needless
       | flex
        
         | londons_explore wrote:
         | > The graph on the first page, in my opinion, seems like a
         | needless flex
         | 
         | Indeed - they used half of the cover page of their paper to
         | show a chart which illustrates... nothing...
        
       | [deleted]
        
       | daemonk wrote:
       | This is relevant also:
       | https://hazyresearch.stanford.edu/blog/2023-03-07-hyena
        
       ___________________________________________________________________
       (page generated 2023-07-06 23:00 UTC)