[HN Gopher] Google Gemini: Context Caching
___________________________________________________________________
Google Gemini: Context Caching
Author : tosh
Score : 125 points
Date : 2024-05-15 07:56 UTC (1 days ago)
(HTM) web link (ai.google.dev)
(TXT) w3m dump (ai.google.dev)
| mritchie712 wrote:
| I really want this (in OpenAI API). Does google not care about
| the reputation it's getting with this shit:
|
| > We'll be launching context caching soon, along with technical
| documentation and SDK support.
|
| it's not live and god knows when it will be
| nextworddev wrote:
| They have adopted the strategy of mainly releasing to &
| iterating with the large enterprise customers, because they
| (rightly) realized that GA'ing to every developer makes no
| monetary sense, unless they are trying to learn something from
| that launch / need to do so for competitive reasons.
| mritchie712 wrote:
| fair point
| refulgentis wrote:
| Presumably, this would be a competitive reason, no? It would
| further the cost savings they GA'd with Gemini Flash, and
| it's a differentiator from every other provider.
| aiauthoritydev wrote:
| This actually is smarter strategy IMO.
|
| Throwing the hat in the ring forces your competitors think
| about it and make them work too. It also gives you first mover
| advantage making people think you did it first.
|
| A lot of AI tools make more sense for their enterprise
| customers rather than people like us. People like us are good
| only for hype and not making money.
| flipbrad wrote:
| Might help show prior art to defeat software patents too.
| hidelooktropic wrote:
| Isn't this what the Assistants API is meant for? Honestly not
| sure, I haven't used it before but their documentation seems to
| suggest you can set up the assistant to already have this
| context, then just send API commands to it without said
| context.
| swyx wrote:
| but they charge u for the full context every time, at least
| for now. latency would suggest that no caching is happening
| sebzim4500 wrote:
| Makes sense.
|
| It isn't in the list of suggested usecases, but I wonder if this
| can be used to speed up tree of thoughts or similar prompt/search
| techniques.
|
| It could also speed up restricted generation, e.g. when you force
| the model to output valid json.
| motoboi wrote:
| I suppose the caching is at a layer before the LLM.
| verdverm wrote:
| another advantage may be to let the big library providers
| (LangChain | LlamaIndex) to implement this before releasing to
| developers at large
| _pdp_ wrote:
| We envisioned a system like this over a year ago, but since we
| lacked technical capabilities in this area, we couldn't solve it.
| To me, it is clear that a lot of information is being transmitted
| needlessly. There is no need to send an entire book every single
| time an interaction occurs. Instead, you can start from a
| checkpoint, remap from memory, and continue from there. Perhaps
| none of the current LLM service providers have an architecture
| that allows this, but it seems like something that should be
| possible and is likely to emerge in the near future.
| gradys wrote:
| The size of the cached internal state of the network processing
| the book is much larger than the size of the book. The resource
| that is preserved with caching is the compute required to
| recreate that state.
| dfgtyu65r wrote:
| Sure, but a direct forwards pass of the book would surely
| require more compute than simply loading and setting the
| hidden state?
|
| The second doesn't require any matrix operations, it's just
| setting some values.
| nancarrow wrote:
| "some" is doing a lot of lifting. # of tokens * # of layers
| * head dimension * # of heads * 2 (K+V vectors) * 4-16bits
| (depending on quantization)
| gradys wrote:
| I don't know of any public details on how they implement
| Context Caching, but that is presumably exactly what they
| are doing. Just caching the text would be a minimal
| savings.
| rfoo wrote:
| > it's just setting some values
|
| But it may very well be slower than just recompute it. At
| least for ordinary MHA and even GQA.
|
| So, either a model arch woodoo significantly reducing kv
| cache size (while keeping roughly the same compute cost),
| or some really careful implementation moving kv cache of
| upcoming requests to devices in background [0].
|
| [0] My back of envelop calc shows that even then it still
| does not make sense for, say, Llama 3 70B on H100s. Time to
| stare at TPU spec harder trying to make sense of it I
| guess.
| sshumaker wrote:
| It depends on how large the input prompt (previous
| context) is. Also, if you can keep cache on GPU with a
| LRU mechanism, for certain workloads it's very efficient.
|
| You can also design an API optimized for batch workloads
| (say the same core prompt with different data for
| instruct-style reasoning) - that can result in large
| savings in those scenarios.
| ethbr1 wrote:
| If you can pipeline upcoming requests and tie state to a
| specific request, doesn't that allow you to change how
| you design physical memory? (at least for inference)
|
| Stupid question, but why wouldn't {extremely large slow-
| write, fast-read memory} + {smaller, very fast-write
| memory} be a feasible hardware architecture?
|
| If you know many, many cycles ahead what you'll need to
| have loaded at a specific time.
|
| Or hell, maybe it's time to go back to memory bank
| switching.
| objektif wrote:
| But isnt the information somehow cached when you start a new
| chat and build context with say GPT4? If the caching was so
| large as you say so many chat sessions in parallel would not
| be possible.
| dosinga wrote:
| That's not my understanding. We can't be sure how OpenAI
| does things themselves, but adding messages to a
| conversation in the API means just rerunning the history
| through the prompt every time
| jsemrau wrote:
| >The size of the cached internal state of the network
| processing the book is much larger than the size of the book
|
| It's funny that sometimes people consider LLMs as compression
| engines. While a lot of information gets lost in each
| direction (through the neural net)
| shwaj wrote:
| Why is that funny? Sometimes compression is lossy, like
| JPEG and H.265
| pornel wrote:
| And the internal state of a JPEG decoder can be an order
| of magnitude larger than the JPEG file (especially
| progressive JPEG that can't stream its output).
| okdood64 wrote:
| I don't lose anything with gzip or rar.
| giancarlostoro wrote:
| And just as fast? The issue here is how do you do these
| things both accurately and while maintaining reasonable
| speeds.
| fwip wrote:
| You can make any lossy compression scheme into a lossless
| scheme by appending the diff between the original and the
| compressed. In many cases, this still results in a size
| savings over the original.
|
| You can think of this as a more detailed form of "I
| before E, except after C, except for species and science
| and..." Or, if you prefer, as continued terms of a
| Taylor-series expansion. The more terms you add, the more
| closely you approximate the original.
| sshumaker wrote:
| They are almost certainly doing this internally for their own
| chat products.
|
| The simple version of this just involves saving off the KV
| cache in the attention layers, and restore it back instead of
| recomputing. It only requires small changes to inference and
| the attention layers.
|
| The main challenge is being able to do this under scale, e.g.
| dump the weights out of GPU memory, persist them, and have a
| system to rapidly reload them as needed (or just regenerate).
| ethbr1 wrote:
| 2024 is the year of serverless LLM?
| twobitshifter wrote:
| I remember hearing that the reason Claude is expensive is that
| every interaction makes it reread the entire conversation.
| jonplackett wrote:
| This is the case with all LLMs as far as I know.
|
| With the chatGPT api you just send everything up to that
| point + the new input to get the new output.
|
| I think the benefit for the service is that it's stateless.
| They just have requests in and out and don't have to worry
| about anything else.
| conradev wrote:
| To me, context caching is only a subset of what is possible
| with full control over the model. I consider this a more
| complete list: https://github.com/microsoft/aici?tab=readme-ov-
| file#flexibi...
|
| Context caching only gets you "forking generation into multiple
| branches" (i.e. sharing work between multiple generations)
| sshumaker wrote:
| This is a pretty standard technique if you're running the models
| yourself. e.g. ChatGPT almost certainly does this.
|
| There's even work that is more sophisticated in this domain that
| allows 'template' style partial caching:
| https://arxiv.org/abs/2311.04934
| lolpanda wrote:
| i think llama.cpp has context caching with "--prompt-cache" but
| it will result in a very large cache file. i guess it's also very
| expensive for any inference api provider to support caching as
| they have to persist the file and load/unload it each time.
___________________________________________________________________
(page generated 2024-05-16 23:00 UTC)