[HN Gopher] Run llama3 locally with 1M token context
___________________________________________________________________
Run llama3 locally with 1M token context
Author : mritchie712
Score : 90 points
Date : 2024-04-30 20:15 UTC (2 hours ago)
(HTM) web link (ollama.com)
(TXT) w3m dump (ollama.com)
| superkuh wrote:
| > Note: using a 256k context window requires at least 64GB of
| memory. Using a 1M+ context window requires significantly more
| (100s of GBs).
|
| The ram requirements at 200k seem to be using an assumed 16 bits?
| With a 4 bit quantization it's more like ~12GB of kv-cache for
| for 200k sequence length. Unless I'm missing something?
| datadrivenangel wrote:
| How much does this do to performance though?
| sebzim4500 wrote:
| Are people quantizing the activations? I was under the
| impression that people generally quantize the parameters but
| leave the activations unchanged.
| brrrrrm wrote:
| You can quantize KV caches
| consumer451 wrote:
| Pardon my ignorance, but I am a dilettante. I was just wondering
| about this earlier today.
|
| How much unlocking of a more active intelligence from LLMs, will
| far larger context length provide?
|
| Would virtually unlimited context move us much closer to an LLM
| where training is continuous? Is that a bit of a holy grail? I
| assume not, but would love to know why.
| mritchie712 wrote:
| context length doesn't unlock intelligence, just more
| information. e.g. adding info that wasn't part of training (or
| wasn't heavily weighted in training).
| consumer451 wrote:
| Right, I should have been more clear than the words "active
| intelligence." But as one use of this... would unlimited, say
| 1 to 10 billion tokens of context, used as a system prompt,
| with "just" 32k left for the user, allow a model to be
| updated every day, in between actual training? (This is in
| the future, where training only takes a month, or much less.)
|
| I guess part of what I really don't understand is how context
| tokens compare to training weights, as far as value to the
| final response. Would a giant context window muddle the value
| of weights?
|
| (Maybe what I am missing is the human-feedback on the
| training weights? If the giant system prompt I am imagining
| is garbage, then that would be bad.)
| farts_mckensy wrote:
| I mean, working memory is an aspect of intelligence.
| a_wild_dandan wrote:
| > How much unlocking of a more active intelligence from LLMs,
| will far larger context length provide?
|
| Remarkably, foundation models can learn new tasks via a few
| examples (called few-shot learning). LLM answers also
| significantly improve when given relevant supplemental
| information. Boosting context length: grows its "working
| memory"; provides richer knowledge to inform its reasoning; and
| expands its capacity for new tasks, given germane examples.
|
| > Would virtually unlimited context move us much closer to an
| LLM where training is continuous?
|
| No. You can already continually train a context-limited LLM.
| Virtually unlimited context window schemes also exist. Training
| is separate concept from context length. In pre-training, we
| work backward from a model's incorrect answer, tweaking its
| parameters to more likely say the correct thing next time.
| Fine-tuning is the same, but focusing on specific tasks
| important to the user. After training, when _running_ the model
| (called inference), you can change the context length to suit
| your needs and tradeoffs.
| consumer451 wrote:
| Thank you. If you don't mind...
|
| > tweaking its parameters to more likely say the correct
| thing next time.
|
| Is this entirely, or just partially done via human feedback
| on models like GPT-4 and LLama-3, for example?
| Workaccount2 wrote:
| I've been using Gemini 1.5 with 1M tokens and it's a totally
| different game. You can essentially "train" the model on whatever
| hyper specific stuff you have on hand.
|
| Don't know how to work with my test system? No problem, here is
| an 800 page reference manual. Now you know.
| haolez wrote:
| Can we expect that the 1M version has the same level of
| intelligence as the vanilla version? Across the whole 1M tokens
| without degradation? What are the trade offs?
| mehulashah wrote:
| Our results are varied on long contexts. True -- one can put a
| lot of stuff in there and the thing groks a bunch of it. Being
| able to synthesize specific answers from "long tail" facts in the
| context can be difficult. YMMV.
| benreesman wrote:
| It's just so clear the math is wrong on these things.
|
| You've got an apparent contradiction: SGD (AdamW at 1e-6 give or
| take) works. So we've got extremely abundant local maxima up to
| epsilon_0, but, it always lands in the same place, so there are
| abundant "well-studied" minima, likewise symmetrical up to
| epsilon_1, both of which are roughly: "start debugging if above
| we can tell".
|
| The maxima have meaningful curvature tensors at or adjacent to
| them: AdamW works.
|
| But joker in the deck: control vectors work. So you're in a
| quasi-Euclidean region.
|
| In fact all the useful regions are about exactly the same, the
| weights are actually complex valued, everyone knows this part....
|
| The conserved quantity up to let's call it phi is compression
| ratio.
|
| Maybe in a year or two when Altman is in jail and Mme. Su gives
| George cards that work, well crunch numbers more interesting than
| how much a googol FMA units cost and Emmy Nother gets some damned
| credit for knowing this a century ago.
___________________________________________________________________
(page generated 2024-04-30 23:01 UTC)