[HN Gopher] Run llama3 locally with 1M token context
       ___________________________________________________________________
        
       Run llama3 locally with 1M token context
        
       Author : mritchie712
       Score  : 90 points
       Date   : 2024-04-30 20:15 UTC (2 hours ago)
        
 (HTM) web link (ollama.com)
 (TXT) w3m dump (ollama.com)
        
       | superkuh wrote:
       | > Note: using a 256k context window requires at least 64GB of
       | memory. Using a 1M+ context window requires significantly more
       | (100s of GBs).
       | 
       | The ram requirements at 200k seem to be using an assumed 16 bits?
       | With a 4 bit quantization it's more like ~12GB of kv-cache for
       | for 200k sequence length. Unless I'm missing something?
        
         | datadrivenangel wrote:
         | How much does this do to performance though?
        
         | sebzim4500 wrote:
         | Are people quantizing the activations? I was under the
         | impression that people generally quantize the parameters but
         | leave the activations unchanged.
        
           | brrrrrm wrote:
           | You can quantize KV caches
        
       | consumer451 wrote:
       | Pardon my ignorance, but I am a dilettante. I was just wondering
       | about this earlier today.
       | 
       | How much unlocking of a more active intelligence from LLMs, will
       | far larger context length provide?
       | 
       | Would virtually unlimited context move us much closer to an LLM
       | where training is continuous? Is that a bit of a holy grail? I
       | assume not, but would love to know why.
        
         | mritchie712 wrote:
         | context length doesn't unlock intelligence, just more
         | information. e.g. adding info that wasn't part of training (or
         | wasn't heavily weighted in training).
        
           | consumer451 wrote:
           | Right, I should have been more clear than the words "active
           | intelligence." But as one use of this... would unlimited, say
           | 1 to 10 billion tokens of context, used as a system prompt,
           | with "just" 32k left for the user, allow a model to be
           | updated every day, in between actual training? (This is in
           | the future, where training only takes a month, or much less.)
           | 
           | I guess part of what I really don't understand is how context
           | tokens compare to training weights, as far as value to the
           | final response. Would a giant context window muddle the value
           | of weights?
           | 
           | (Maybe what I am missing is the human-feedback on the
           | training weights? If the giant system prompt I am imagining
           | is garbage, then that would be bad.)
        
           | farts_mckensy wrote:
           | I mean, working memory is an aspect of intelligence.
        
         | a_wild_dandan wrote:
         | > How much unlocking of a more active intelligence from LLMs,
         | will far larger context length provide?
         | 
         | Remarkably, foundation models can learn new tasks via a few
         | examples (called few-shot learning). LLM answers also
         | significantly improve when given relevant supplemental
         | information. Boosting context length: grows its "working
         | memory"; provides richer knowledge to inform its reasoning; and
         | expands its capacity for new tasks, given germane examples.
         | 
         | > Would virtually unlimited context move us much closer to an
         | LLM where training is continuous?
         | 
         | No. You can already continually train a context-limited LLM.
         | Virtually unlimited context window schemes also exist. Training
         | is separate concept from context length. In pre-training, we
         | work backward from a model's incorrect answer, tweaking its
         | parameters to more likely say the correct thing next time.
         | Fine-tuning is the same, but focusing on specific tasks
         | important to the user. After training, when _running_ the model
         | (called inference), you can change the context length to suit
         | your needs and tradeoffs.
        
           | consumer451 wrote:
           | Thank you. If you don't mind...
           | 
           | > tweaking its parameters to more likely say the correct
           | thing next time.
           | 
           | Is this entirely, or just partially done via human feedback
           | on models like GPT-4 and LLama-3, for example?
        
       | Workaccount2 wrote:
       | I've been using Gemini 1.5 with 1M tokens and it's a totally
       | different game. You can essentially "train" the model on whatever
       | hyper specific stuff you have on hand.
       | 
       | Don't know how to work with my test system? No problem, here is
       | an 800 page reference manual. Now you know.
        
       | haolez wrote:
       | Can we expect that the 1M version has the same level of
       | intelligence as the vanilla version? Across the whole 1M tokens
       | without degradation? What are the trade offs?
        
       | mehulashah wrote:
       | Our results are varied on long contexts. True -- one can put a
       | lot of stuff in there and the thing groks a bunch of it. Being
       | able to synthesize specific answers from "long tail" facts in the
       | context can be difficult. YMMV.
        
       | benreesman wrote:
       | It's just so clear the math is wrong on these things.
       | 
       | You've got an apparent contradiction: SGD (AdamW at 1e-6 give or
       | take) works. So we've got extremely abundant local maxima up to
       | epsilon_0, but, it always lands in the same place, so there are
       | abundant "well-studied" minima, likewise symmetrical up to
       | epsilon_1, both of which are roughly: "start debugging if above
       | we can tell".
       | 
       | The maxima have meaningful curvature tensors at or adjacent to
       | them: AdamW works.
       | 
       | But joker in the deck: control vectors work. So you're in a
       | quasi-Euclidean region.
       | 
       | In fact all the useful regions are about exactly the same, the
       | weights are actually complex valued, everyone knows this part....
       | 
       | The conserved quantity up to let's call it phi is compression
       | ratio.
       | 
       | Maybe in a year or two when Altman is in jail and Mme. Su gives
       | George cards that work, well crunch numbers more interesting than
       | how much a googol FMA units cost and Emmy Nother gets some damned
       | credit for knowing this a century ago.
        
       ___________________________________________________________________
       (page generated 2024-04-30 23:01 UTC)