[HN Gopher] LongRoPE: Extending LLM Context Window Beyond 2M Tokens
___________________________________________________________________
LongRoPE: Extending LLM Context Window Beyond 2M Tokens
Author : nojito
Score : 113 points
Date : 2024-02-22 10:44 UTC (12 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| Delumine wrote:
| When we consider the world in its entirety, the mere existence of
| computer components doesn't signify that we've reached the
| pinnacle of technological advancement. I firmly believe that our
| collective intelligence is deeply embedded in our language,
| whether it's conventional or programming-based. As we witness
| daily advancements in our language models, we're enhancing the
| efficacy of what we can currently regard as nascent artificial
| "intelligence".
|
| This is precisely why programs that are nourished with innovative
| models, backed by substantial computational power, are capable of
| developing reasoning akin to Q*. Once these models start to
| independently foster these advancements and self-improve, we'll
| witness an unprecedented surge in AI development, surpassing our
| current capabilities.
| visarga wrote:
| I agree with the part where you said our collective
| intelligence is embedded in language. Intelligence is a social
| process, none of us are too great alone. We like to forget that
| and assume it's all in our heads or in the models
| dmezzetti wrote:
| It's still up for debate as to whether long context windows are
| worth it. It's also not a cheap way to solve a problem.
|
| Clearly hardware vendors love LLMs. But it's just a highly
| inefficient approach for a lot of problems.
| CharlesW wrote:
| > _It 's still up for debate as to whether long context windows
| are worth it._
|
| As someone who's run into problems related to short context
| windows quite often, can you explain what "worth it" means?
| Also, when you say "not cheap", what alternative do you have in
| mind?
| esafak wrote:
| Continuous fine tuning so you don't have to pass everything
| in the context with every query would definitely yield a
| better UX.
| littlestymaar wrote:
| Afaik the few-shots learning abilities of LLMs are much
| better than what you can achieve with fine tuning when you
| only have tiny samples.
|
| Of course it is true for real long context but it's not
| clear if its going to work with the sparse context hacks
| intended to keep memory size low.
|
| Having the ability to keep the same coding session forever
| without having the LLM forgetting all the time and making
| the same mistakes over and over would be a game changer.
| nl wrote:
| I'd argue the fine-tuned UX is often worse.
|
| People like RAG-based solutions because you can include
| references in your answer very easily (eg, Perplexity, or
| see the DAnswer "internal search" product launched today on
| HN). That is extremely hard to make work reliably from a
| fine-tuned model.
| esafak wrote:
| I mean UX from the perspective of the person doing the
| search, not the engineer. I don't dispute that fine
| tuning is harder to implement.
| dmezzetti wrote:
| Retrieval Augmented Generation aka RAG with a relevant
| context. Long contexts continue to have the lost in the
| middle syndrome.
| nl wrote:
| Gemini doesn't have the "lost in the middle syndrome"
| pixl97 wrote:
| A cheap way to solve a problem has always been throwing more
| compute power at it.
|
| Figuring out new algorithms is monumentally more difficult than
| more hardware, and even then when more efficient algorithms are
| found, we throw more hardware at it and get a million times
| more done.
| szundi wrote:
| Billions of dollars spent on hardware kind of makes this
| rational to try
| tkellogg wrote:
| It depends what problem you're solving. If it's a high
| frequency request, like a chat response, it's far too
| inefficient. Most web APIs would consider it bad practice to
| read 2MB of data on every request, even worse when you consider
| all the LLM computation. Instead, use RAG and pull targeted
| info out of some sort of low-latency database.
|
| However, caching might be a sweet spot for these multi-modal
| and large context LLMs. Take a bunch of documents and perform
| reasoning tasks to distill the knowledge down into something
| like a knowledge graph, to be used in RAG.
| wongarsu wrote:
| Having them available is still hugely beneficial, even if they
| end up too expensive to use in most production use-cases.
|
| They would still be valuable for prototyping, where fast
| iteration makes it possible to learn more about the problem you
| are solving and whether it is even worth solving. They are also
| valuable for iteration-and-distillation approaches, where you
| can use the data generated from an expensive model to train a
| cheaper model.
| Fripplebubby wrote:
| The cool thing about this paper is, actually it is a
| (relatively) cheap way to solve this particular problem (LLM on
| 2048k window), because they are using pre-trained models like
| LLaMA2 and Mistral and extending them to 2048k windows using a
| novel technique, rather than training a model from scratch with
| 2048k tokens which would be prohibitively expensive to mere
| mortals.
| ogogmad wrote:
| It seems like a very long input to an LLM can act as additional
| training data. A possible future is that LLMs will be used to
| bootstrap AIs that will be "trained" by feeding them a giant
| prompt at the start. Might end the era of "supervised"
| learning, "reinforcement" learning, numerical methods and
| gradient descent -- those techniques are awkward and
| procrustean. Imagine what you could do if you could optimize a
| neural network just by talking to it in English?
|
| So it looks like a VERY good idea. Who gets the gist of what
| I'm saying?
| jonathan-adly wrote:
| I didn't really trust Google's 1m+ context. But I trust this
| paper and 1m+ context would be available everywhere.
|
| I know people complain about hardware and compute resources. But,
| this is like complaining about Python resource usage in the early
| 90's. The development complexity & resources is far more
| expensive than chips on the long run. I am personally re-
| organizing my AI organization to move away from complex RAG
| setups and get comfortable with long-context workflows.
|
| Just to be clear - I also think that inference-optimized chips
| are the next frontier - Nvidia GPUs were designed & built in a
| different age than what's going on now.
| foobiekr wrote:
| This is 100% not true at all. Chips and electricity and the
| development resources would need to be enormous for your
| statement to be true.
|
| It's really not even true for your python example.
| sebzim4500 wrote:
| Quite a few people have access to Gemini's 1M context length
| and it does seem to work very well.
|
| E.g. can produce accurate scene-by-scene descriptions of hour
| long movies, can answer questions about very large codebases,
| etc.
| devinprater wrote:
| This is a dream come true for me. Full movie descriptions!
| Or, uh, more likly for me, full video game playthrough
| descriptions!
| TrueDuality wrote:
| Legitimately curious, what is lacking from current
| playthrough and movie descriptions available now? For games
| the only things I feel like might not be represented well
| in text is descriptions of the observer/player reactions
| and maybe details about their specific choices through a
| game... Kind of like turning a choose your own adventure
| book into a normal book by tracing one path through it?
|
| I can see Q&A on a movie being useful but can't think of
| descriptions of the overall movie itself lacking... I'd
| definitely be worried about spoilers.
| isaacfung wrote:
| It may provide a way for AI to learn a complete new
| topic(a new programming language, stock investment,
| sport, games) in a zero shot way and immediately apply it
| by just feeding a youtube playlist/udemy course into an
| AI model.
| sp332 wrote:
| Does this mean we can skip "tokenization" and feed raw bytes into
| our models now?
| kristjansson wrote:
| That's been on the table e.g. [0][1].
|
| But since this work depends on strong pre-trained model to
| extend from, I think it's open whether a training a byte-level
| model from scratch with similar tricks would result in the same
| performance (and whether any organization in the world has the
| GPUs and chutzpah to do pre-training at the long context
| lengths...)
|
| [0]: https://arxiv.org/abs/2305.07185 [1]:
| https://arxiv.org/abs/2401.13660
| sp332 wrote:
| Ok so to get a little weirder, could you use a normal
| tokenizer for pretraining and then go back to bytes for fine
| tuning along with or before the length extension?
| Fripplebubby wrote:
| Do the bytes represent natural language text, or something
| else? If they do represent text, then it is not so weird to
| me.
| kristjansson wrote:
| Tokens (be they tokens or bytes) basically atoms of input
| to a LM. So you'd have to figure out how to express the
| relationship between tokens and bytes to the LM. I guess
| you could do an incremental thing e.g.
|
| - make vocab ~40k tokens + 256 tokens (for each possible
| byte),
|
| - start training with 100% tokens,
|
| - then after some portion of train budget has elapsed,
| randomly replace some tokens with corresponding byte-
| tokens,
|
| - then ramp the fraction of replacements up so you're
| training on 100% byte tokens for the last xx% of training,
| without ever exceeding an 8k (or whatever) sequence length
|
| - then apply the trick from TFA to get xk -> 8xk/64xk bytes
| of context?
|
| but I'd guess the interesting part of a byte-transformer is
| multimodality, and we'd need more than a few tricks to get
| from ^^ to there.
| lettergram wrote:
| It's been possible to skip tokenization for a long time, my
| team and I did it here -
| https://github.com/capitalone/DataProfiler
|
| For what it's worth, we actually were working with LSTMs with
| nearly a billion params back in 2016-2017 area. Transformers
| made it far more effective to train and execute, but ultimately
| LSTMs are able to achieve similar results, though slow &
| require more training data.
| maytc wrote:
| Stupid question but I thought transformers have an O(n^2) memory
| usage. With 2M tokens, won't I need dozens and dozens of GPUs
| just to run the base LLaMA2 models?
| makerdiety wrote:
| Maybe computer processor hacks are used? Like, it's the
| equivalent of finding the eigenvalues of a matrix.
|
| I'm not as familiar with CPUs as I am with mathematical
| concepts. I don't know what the name for the processor bit
| hacking tricks is called. But that's maybe the general idea for
| data compression for LLMs/transformer models on CPUs, I think.
|
| After all, notice how data compression improvements are only
| multiples of two. 128k tokens and 2048k tokens. There's an
| implementation dependent CPU optimization hack going on in
| there somewhere.
| kristjansson wrote:
| FlashAttention(2)[0] reduces context-length space complexity to
| linear. Compute is still O(n^2) in length though, AFAIK, so
| we'd expect these long sequence lengths to take some time to
| compute.
|
| I'm a bit out of my depth, but I think ultra-long exact-
| attention work like this also probably has to answer some
| questions about where to put the KV-cache before it can be used
| in practice?
|
| [0]: https://arxiv.org/abs/2205.14135
| Imnimo wrote:
| It's interesting that according to the results tables 5 and 6,
| adding more context even with LongRoPE makes predictions _worse_
| on Books3, and only gives improvement on Proof-Pile. What 's
| special about the Proof-Pile dataset? Is there some reason we
| should expect it to have a lot of dependencies at ranges greater
| than 100k tokens? Should I be surprised or suspicious that
| performance is flat going from 65k to 131k, but then has a big
| jump to 262k?
| Fripplebubby wrote:
| One thing you may have overlooked - table 5, the proof-pile
| table, only goes up to a 262k evaluation window (meaning -
| although the model has an extended context window of 2048k
| according to the method proposed, they are not feeding in that
| many tokens, only 262k tokens - so, about 13% of the total
| possible window).
|
| Why? I think this is because books3 contains, you know, books -
| including some really long books, and proof-pile contains math
| papers and math stuff, which isn't as long.
|
| So overall I think what you're seeing is a general trend of
| increasing perplexity on windows above 256k, between
| 256k-2048k, which is probably not so surprising - or at least,
| not so surprising when you consider the context of the paper,
| which is taking a model pre-trained with a much shorter context
| window and extending the context window using a novel
| technique. It's hard to adapt a model trained to do one thing
| into doing another thing, and that's what they're doing, so in
| that context, it tracks that the longer the context window, the
| worse the performance.
___________________________________________________________________
(page generated 2024-02-22 23:01 UTC)