[HN Gopher] LongRoPE: Extending LLM Context Window Beyond 2M Tokens
       ___________________________________________________________________
        
       LongRoPE: Extending LLM Context Window Beyond 2M Tokens
        
       Author : nojito
       Score  : 113 points
       Date   : 2024-02-22 10:44 UTC (12 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | Delumine wrote:
       | When we consider the world in its entirety, the mere existence of
       | computer components doesn't signify that we've reached the
       | pinnacle of technological advancement. I firmly believe that our
       | collective intelligence is deeply embedded in our language,
       | whether it's conventional or programming-based. As we witness
       | daily advancements in our language models, we're enhancing the
       | efficacy of what we can currently regard as nascent artificial
       | "intelligence".
       | 
       | This is precisely why programs that are nourished with innovative
       | models, backed by substantial computational power, are capable of
       | developing reasoning akin to Q*. Once these models start to
       | independently foster these advancements and self-improve, we'll
       | witness an unprecedented surge in AI development, surpassing our
       | current capabilities.
        
         | visarga wrote:
         | I agree with the part where you said our collective
         | intelligence is embedded in language. Intelligence is a social
         | process, none of us are too great alone. We like to forget that
         | and assume it's all in our heads or in the models
        
       | dmezzetti wrote:
       | It's still up for debate as to whether long context windows are
       | worth it. It's also not a cheap way to solve a problem.
       | 
       | Clearly hardware vendors love LLMs. But it's just a highly
       | inefficient approach for a lot of problems.
        
         | CharlesW wrote:
         | > _It 's still up for debate as to whether long context windows
         | are worth it._
         | 
         | As someone who's run into problems related to short context
         | windows quite often, can you explain what "worth it" means?
         | Also, when you say "not cheap", what alternative do you have in
         | mind?
        
           | esafak wrote:
           | Continuous fine tuning so you don't have to pass everything
           | in the context with every query would definitely yield a
           | better UX.
        
             | littlestymaar wrote:
             | Afaik the few-shots learning abilities of LLMs are much
             | better than what you can achieve with fine tuning when you
             | only have tiny samples.
             | 
             | Of course it is true for real long context but it's not
             | clear if its going to work with the sparse context hacks
             | intended to keep memory size low.
             | 
             | Having the ability to keep the same coding session forever
             | without having the LLM forgetting all the time and making
             | the same mistakes over and over would be a game changer.
        
             | nl wrote:
             | I'd argue the fine-tuned UX is often worse.
             | 
             | People like RAG-based solutions because you can include
             | references in your answer very easily (eg, Perplexity, or
             | see the DAnswer "internal search" product launched today on
             | HN). That is extremely hard to make work reliably from a
             | fine-tuned model.
        
               | esafak wrote:
               | I mean UX from the perspective of the person doing the
               | search, not the engineer. I don't dispute that fine
               | tuning is harder to implement.
        
           | dmezzetti wrote:
           | Retrieval Augmented Generation aka RAG with a relevant
           | context. Long contexts continue to have the lost in the
           | middle syndrome.
        
             | nl wrote:
             | Gemini doesn't have the "lost in the middle syndrome"
        
         | pixl97 wrote:
         | A cheap way to solve a problem has always been throwing more
         | compute power at it.
         | 
         | Figuring out new algorithms is monumentally more difficult than
         | more hardware, and even then when more efficient algorithms are
         | found, we throw more hardware at it and get a million times
         | more done.
        
           | szundi wrote:
           | Billions of dollars spent on hardware kind of makes this
           | rational to try
        
         | tkellogg wrote:
         | It depends what problem you're solving. If it's a high
         | frequency request, like a chat response, it's far too
         | inefficient. Most web APIs would consider it bad practice to
         | read 2MB of data on every request, even worse when you consider
         | all the LLM computation. Instead, use RAG and pull targeted
         | info out of some sort of low-latency database.
         | 
         | However, caching might be a sweet spot for these multi-modal
         | and large context LLMs. Take a bunch of documents and perform
         | reasoning tasks to distill the knowledge down into something
         | like a knowledge graph, to be used in RAG.
        
         | wongarsu wrote:
         | Having them available is still hugely beneficial, even if they
         | end up too expensive to use in most production use-cases.
         | 
         | They would still be valuable for prototyping, where fast
         | iteration makes it possible to learn more about the problem you
         | are solving and whether it is even worth solving. They are also
         | valuable for iteration-and-distillation approaches, where you
         | can use the data generated from an expensive model to train a
         | cheaper model.
        
         | Fripplebubby wrote:
         | The cool thing about this paper is, actually it is a
         | (relatively) cheap way to solve this particular problem (LLM on
         | 2048k window), because they are using pre-trained models like
         | LLaMA2 and Mistral and extending them to 2048k windows using a
         | novel technique, rather than training a model from scratch with
         | 2048k tokens which would be prohibitively expensive to mere
         | mortals.
        
         | ogogmad wrote:
         | It seems like a very long input to an LLM can act as additional
         | training data. A possible future is that LLMs will be used to
         | bootstrap AIs that will be "trained" by feeding them a giant
         | prompt at the start. Might end the era of "supervised"
         | learning, "reinforcement" learning, numerical methods and
         | gradient descent -- those techniques are awkward and
         | procrustean. Imagine what you could do if you could optimize a
         | neural network just by talking to it in English?
         | 
         | So it looks like a VERY good idea. Who gets the gist of what
         | I'm saying?
        
       | jonathan-adly wrote:
       | I didn't really trust Google's 1m+ context. But I trust this
       | paper and 1m+ context would be available everywhere.
       | 
       | I know people complain about hardware and compute resources. But,
       | this is like complaining about Python resource usage in the early
       | 90's. The development complexity & resources is far more
       | expensive than chips on the long run. I am personally re-
       | organizing my AI organization to move away from complex RAG
       | setups and get comfortable with long-context workflows.
       | 
       | Just to be clear - I also think that inference-optimized chips
       | are the next frontier - Nvidia GPUs were designed & built in a
       | different age than what's going on now.
        
         | foobiekr wrote:
         | This is 100% not true at all. Chips and electricity and the
         | development resources would need to be enormous for your
         | statement to be true.
         | 
         | It's really not even true for your python example.
        
         | sebzim4500 wrote:
         | Quite a few people have access to Gemini's 1M context length
         | and it does seem to work very well.
         | 
         | E.g. can produce accurate scene-by-scene descriptions of hour
         | long movies, can answer questions about very large codebases,
         | etc.
        
           | devinprater wrote:
           | This is a dream come true for me. Full movie descriptions!
           | Or, uh, more likly for me, full video game playthrough
           | descriptions!
        
             | TrueDuality wrote:
             | Legitimately curious, what is lacking from current
             | playthrough and movie descriptions available now? For games
             | the only things I feel like might not be represented well
             | in text is descriptions of the observer/player reactions
             | and maybe details about their specific choices through a
             | game... Kind of like turning a choose your own adventure
             | book into a normal book by tracing one path through it?
             | 
             | I can see Q&A on a movie being useful but can't think of
             | descriptions of the overall movie itself lacking... I'd
             | definitely be worried about spoilers.
        
               | isaacfung wrote:
               | It may provide a way for AI to learn a complete new
               | topic(a new programming language, stock investment,
               | sport, games) in a zero shot way and immediately apply it
               | by just feeding a youtube playlist/udemy course into an
               | AI model.
        
       | sp332 wrote:
       | Does this mean we can skip "tokenization" and feed raw bytes into
       | our models now?
        
         | kristjansson wrote:
         | That's been on the table e.g. [0][1].
         | 
         | But since this work depends on strong pre-trained model to
         | extend from, I think it's open whether a training a byte-level
         | model from scratch with similar tricks would result in the same
         | performance (and whether any organization in the world has the
         | GPUs and chutzpah to do pre-training at the long context
         | lengths...)
         | 
         | [0]: https://arxiv.org/abs/2305.07185 [1]:
         | https://arxiv.org/abs/2401.13660
        
           | sp332 wrote:
           | Ok so to get a little weirder, could you use a normal
           | tokenizer for pretraining and then go back to bytes for fine
           | tuning along with or before the length extension?
        
             | Fripplebubby wrote:
             | Do the bytes represent natural language text, or something
             | else? If they do represent text, then it is not so weird to
             | me.
        
             | kristjansson wrote:
             | Tokens (be they tokens or bytes) basically atoms of input
             | to a LM. So you'd have to figure out how to express the
             | relationship between tokens and bytes to the LM. I guess
             | you could do an incremental thing e.g.
             | 
             | - make vocab ~40k tokens + 256 tokens (for each possible
             | byte),
             | 
             | - start training with 100% tokens,
             | 
             | - then after some portion of train budget has elapsed,
             | randomly replace some tokens with corresponding byte-
             | tokens,
             | 
             | - then ramp the fraction of replacements up so you're
             | training on 100% byte tokens for the last xx% of training,
             | without ever exceeding an 8k (or whatever) sequence length
             | 
             | - then apply the trick from TFA to get xk -> 8xk/64xk bytes
             | of context?
             | 
             | but I'd guess the interesting part of a byte-transformer is
             | multimodality, and we'd need more than a few tricks to get
             | from ^^ to there.
        
         | lettergram wrote:
         | It's been possible to skip tokenization for a long time, my
         | team and I did it here -
         | https://github.com/capitalone/DataProfiler
         | 
         | For what it's worth, we actually were working with LSTMs with
         | nearly a billion params back in 2016-2017 area. Transformers
         | made it far more effective to train and execute, but ultimately
         | LSTMs are able to achieve similar results, though slow &
         | require more training data.
        
       | maytc wrote:
       | Stupid question but I thought transformers have an O(n^2) memory
       | usage. With 2M tokens, won't I need dozens and dozens of GPUs
       | just to run the base LLaMA2 models?
        
         | makerdiety wrote:
         | Maybe computer processor hacks are used? Like, it's the
         | equivalent of finding the eigenvalues of a matrix.
         | 
         | I'm not as familiar with CPUs as I am with mathematical
         | concepts. I don't know what the name for the processor bit
         | hacking tricks is called. But that's maybe the general idea for
         | data compression for LLMs/transformer models on CPUs, I think.
         | 
         | After all, notice how data compression improvements are only
         | multiples of two. 128k tokens and 2048k tokens. There's an
         | implementation dependent CPU optimization hack going on in
         | there somewhere.
        
         | kristjansson wrote:
         | FlashAttention(2)[0] reduces context-length space complexity to
         | linear. Compute is still O(n^2) in length though, AFAIK, so
         | we'd expect these long sequence lengths to take some time to
         | compute.
         | 
         | I'm a bit out of my depth, but I think ultra-long exact-
         | attention work like this also probably has to answer some
         | questions about where to put the KV-cache before it can be used
         | in practice?
         | 
         | [0]: https://arxiv.org/abs/2205.14135
        
       | Imnimo wrote:
       | It's interesting that according to the results tables 5 and 6,
       | adding more context even with LongRoPE makes predictions _worse_
       | on Books3, and only gives improvement on Proof-Pile. What 's
       | special about the Proof-Pile dataset? Is there some reason we
       | should expect it to have a lot of dependencies at ranges greater
       | than 100k tokens? Should I be surprised or suspicious that
       | performance is flat going from 65k to 131k, but then has a big
       | jump to 262k?
        
         | Fripplebubby wrote:
         | One thing you may have overlooked - table 5, the proof-pile
         | table, only goes up to a 262k evaluation window (meaning -
         | although the model has an extended context window of 2048k
         | according to the method proposed, they are not feeding in that
         | many tokens, only 262k tokens - so, about 13% of the total
         | possible window).
         | 
         | Why? I think this is because books3 contains, you know, books -
         | including some really long books, and proof-pile contains math
         | papers and math stuff, which isn't as long.
         | 
         | So overall I think what you're seeing is a general trend of
         | increasing perplexity on windows above 256k, between
         | 256k-2048k, which is probably not so surprising - or at least,
         | not so surprising when you consider the context of the paper,
         | which is taking a model pre-trained with a much shorter context
         | window and extending the context window using a novel
         | technique. It's hard to adapt a model trained to do one thing
         | into doing another thing, and that's what they're doing, so in
         | that context, it tracks that the longer the context window, the
         | worse the performance.
        
       ___________________________________________________________________
       (page generated 2024-02-22 23:01 UTC)