[HN Gopher] Google Gemini: Context Caching
       ___________________________________________________________________
        
       Google Gemini: Context Caching
        
       Author : tosh
       Score  : 125 points
       Date   : 2024-05-15 07:56 UTC (1 days ago)
        
 (HTM) web link (ai.google.dev)
 (TXT) w3m dump (ai.google.dev)
        
       | mritchie712 wrote:
       | I really want this (in OpenAI API). Does google not care about
       | the reputation it's getting with this shit:
       | 
       | > We'll be launching context caching soon, along with technical
       | documentation and SDK support.
       | 
       | it's not live and god knows when it will be
        
         | nextworddev wrote:
         | They have adopted the strategy of mainly releasing to &
         | iterating with the large enterprise customers, because they
         | (rightly) realized that GA'ing to every developer makes no
         | monetary sense, unless they are trying to learn something from
         | that launch / need to do so for competitive reasons.
        
           | mritchie712 wrote:
           | fair point
        
           | refulgentis wrote:
           | Presumably, this would be a competitive reason, no? It would
           | further the cost savings they GA'd with Gemini Flash, and
           | it's a differentiator from every other provider.
        
         | aiauthoritydev wrote:
         | This actually is smarter strategy IMO.
         | 
         | Throwing the hat in the ring forces your competitors think
         | about it and make them work too. It also gives you first mover
         | advantage making people think you did it first.
         | 
         | A lot of AI tools make more sense for their enterprise
         | customers rather than people like us. People like us are good
         | only for hype and not making money.
        
           | flipbrad wrote:
           | Might help show prior art to defeat software patents too.
        
         | hidelooktropic wrote:
         | Isn't this what the Assistants API is meant for? Honestly not
         | sure, I haven't used it before but their documentation seems to
         | suggest you can set up the assistant to already have this
         | context, then just send API commands to it without said
         | context.
        
           | swyx wrote:
           | but they charge u for the full context every time, at least
           | for now. latency would suggest that no caching is happening
        
       | sebzim4500 wrote:
       | Makes sense.
       | 
       | It isn't in the list of suggested usecases, but I wonder if this
       | can be used to speed up tree of thoughts or similar prompt/search
       | techniques.
       | 
       | It could also speed up restricted generation, e.g. when you force
       | the model to output valid json.
        
         | motoboi wrote:
         | I suppose the caching is at a layer before the LLM.
        
         | verdverm wrote:
         | another advantage may be to let the big library providers
         | (LangChain | LlamaIndex) to implement this before releasing to
         | developers at large
        
       | _pdp_ wrote:
       | We envisioned a system like this over a year ago, but since we
       | lacked technical capabilities in this area, we couldn't solve it.
       | To me, it is clear that a lot of information is being transmitted
       | needlessly. There is no need to send an entire book every single
       | time an interaction occurs. Instead, you can start from a
       | checkpoint, remap from memory, and continue from there. Perhaps
       | none of the current LLM service providers have an architecture
       | that allows this, but it seems like something that should be
       | possible and is likely to emerge in the near future.
        
         | gradys wrote:
         | The size of the cached internal state of the network processing
         | the book is much larger than the size of the book. The resource
         | that is preserved with caching is the compute required to
         | recreate that state.
        
           | dfgtyu65r wrote:
           | Sure, but a direct forwards pass of the book would surely
           | require more compute than simply loading and setting the
           | hidden state?
           | 
           | The second doesn't require any matrix operations, it's just
           | setting some values.
        
             | nancarrow wrote:
             | "some" is doing a lot of lifting. # of tokens * # of layers
             | * head dimension * # of heads * 2 (K+V vectors) * 4-16bits
             | (depending on quantization)
        
             | gradys wrote:
             | I don't know of any public details on how they implement
             | Context Caching, but that is presumably exactly what they
             | are doing. Just caching the text would be a minimal
             | savings.
        
             | rfoo wrote:
             | > it's just setting some values
             | 
             | But it may very well be slower than just recompute it. At
             | least for ordinary MHA and even GQA.
             | 
             | So, either a model arch woodoo significantly reducing kv
             | cache size (while keeping roughly the same compute cost),
             | or some really careful implementation moving kv cache of
             | upcoming requests to devices in background [0].
             | 
             | [0] My back of envelop calc shows that even then it still
             | does not make sense for, say, Llama 3 70B on H100s. Time to
             | stare at TPU spec harder trying to make sense of it I
             | guess.
        
               | sshumaker wrote:
               | It depends on how large the input prompt (previous
               | context) is. Also, if you can keep cache on GPU with a
               | LRU mechanism, for certain workloads it's very efficient.
               | 
               | You can also design an API optimized for batch workloads
               | (say the same core prompt with different data for
               | instruct-style reasoning) - that can result in large
               | savings in those scenarios.
        
               | ethbr1 wrote:
               | If you can pipeline upcoming requests and tie state to a
               | specific request, doesn't that allow you to change how
               | you design physical memory? (at least for inference)
               | 
               | Stupid question, but why wouldn't {extremely large slow-
               | write, fast-read memory} + {smaller, very fast-write
               | memory} be a feasible hardware architecture?
               | 
               | If you know many, many cycles ahead what you'll need to
               | have loaded at a specific time.
               | 
               | Or hell, maybe it's time to go back to memory bank
               | switching.
        
           | objektif wrote:
           | But isnt the information somehow cached when you start a new
           | chat and build context with say GPT4? If the caching was so
           | large as you say so many chat sessions in parallel would not
           | be possible.
        
             | dosinga wrote:
             | That's not my understanding. We can't be sure how OpenAI
             | does things themselves, but adding messages to a
             | conversation in the API means just rerunning the history
             | through the prompt every time
        
           | jsemrau wrote:
           | >The size of the cached internal state of the network
           | processing the book is much larger than the size of the book
           | 
           | It's funny that sometimes people consider LLMs as compression
           | engines. While a lot of information gets lost in each
           | direction (through the neural net)
        
             | shwaj wrote:
             | Why is that funny? Sometimes compression is lossy, like
             | JPEG and H.265
        
               | pornel wrote:
               | And the internal state of a JPEG decoder can be an order
               | of magnitude larger than the JPEG file (especially
               | progressive JPEG that can't stream its output).
        
               | okdood64 wrote:
               | I don't lose anything with gzip or rar.
        
               | giancarlostoro wrote:
               | And just as fast? The issue here is how do you do these
               | things both accurately and while maintaining reasonable
               | speeds.
        
               | fwip wrote:
               | You can make any lossy compression scheme into a lossless
               | scheme by appending the diff between the original and the
               | compressed. In many cases, this still results in a size
               | savings over the original.
               | 
               | You can think of this as a more detailed form of "I
               | before E, except after C, except for species and science
               | and..." Or, if you prefer, as continued terms of a
               | Taylor-series expansion. The more terms you add, the more
               | closely you approximate the original.
        
         | sshumaker wrote:
         | They are almost certainly doing this internally for their own
         | chat products.
         | 
         | The simple version of this just involves saving off the KV
         | cache in the attention layers, and restore it back instead of
         | recomputing. It only requires small changes to inference and
         | the attention layers.
         | 
         | The main challenge is being able to do this under scale, e.g.
         | dump the weights out of GPU memory, persist them, and have a
         | system to rapidly reload them as needed (or just regenerate).
        
           | ethbr1 wrote:
           | 2024 is the year of serverless LLM?
        
         | twobitshifter wrote:
         | I remember hearing that the reason Claude is expensive is that
         | every interaction makes it reread the entire conversation.
        
           | jonplackett wrote:
           | This is the case with all LLMs as far as I know.
           | 
           | With the chatGPT api you just send everything up to that
           | point + the new input to get the new output.
           | 
           | I think the benefit for the service is that it's stateless.
           | They just have requests in and out and don't have to worry
           | about anything else.
        
         | conradev wrote:
         | To me, context caching is only a subset of what is possible
         | with full control over the model. I consider this a more
         | complete list: https://github.com/microsoft/aici?tab=readme-ov-
         | file#flexibi...
         | 
         | Context caching only gets you "forking generation into multiple
         | branches" (i.e. sharing work between multiple generations)
        
       | sshumaker wrote:
       | This is a pretty standard technique if you're running the models
       | yourself. e.g. ChatGPT almost certainly does this.
       | 
       | There's even work that is more sophisticated in this domain that
       | allows 'template' style partial caching:
       | https://arxiv.org/abs/2311.04934
        
       | lolpanda wrote:
       | i think llama.cpp has context caching with "--prompt-cache" but
       | it will result in a very large cache file. i guess it's also very
       | expensive for any inference api provider to support caching as
       | they have to persist the file and load/unload it each time.
        
       ___________________________________________________________________
       (page generated 2024-05-16 23:00 UTC)