[HN Gopher] Claude just slashed the cost of building AI applicat...
       ___________________________________________________________________
        
       Claude just slashed the cost of building AI applications
        
       Author : fallinditch
       Score  : 67 points
       Date   : 2024-08-18 19:14 UTC (3 hours ago)
        
 (HTM) web link (www.indiehackers.com)
 (TXT) w3m dump (www.indiehackers.com)
        
       | NBJack wrote:
       | Sounds kinda useless, TBH. This sounds as if it assumes the exact
       | same context window across requests. If so, given the 5 minute
       | window, unless for example your entire team is operating in the
       | same codebase at the same time, you won't really see any savings
       | beyond simple prompts.
       | 
       | Are contexts included in the prompt cache? Are they identified as
       | the same or not? What happens if we approach the 10k token range?
       | 128k? 1M?
        
         | minimaxir wrote:
         | The documentation is here and has better examples:
         | https://docs.anthropic.com/en/docs/build-with-claude/prompt-...
         | 
         | tl;dr can cache system prompts, tools, and user messages (up to
         | 4 total) with better returns on massive inputs such as
         | documents.
         | 
         | The use case is more for client-facing applications that would
         | hit the cache frequently rather than internal copilots.
        
         | calendarsnack wrote:
         | Here is a simple use-case that comes to mind. Let's say you
         | have a medium-size repository, such that all the source files
         | can fit in the context. You want to use Claude as an advanced
         | auto-complete in a certain file, but it's important that it has
         | visibility to other files in the repo.
         | 
         | You can put all the other source files in an initial system
         | message and the current file in the user message. Then, if you
         | call multiple autocompletes within 5 minutes of each other, you
         | pay a drastically reduced price for including all of the other
         | files in the context. Also, the latency is much reduced!
         | 
         | Yes, you could probably get a similar outcome by incorporating
         | RAG, search tools etc to the autocomplete, but as a simple
         | approach with fewer moving parts, the caching will reduce costs
         | for this setup.
        
         | config_yml wrote:
         | I have a system prompt of around 100k which includes a couple
         | of internal schema definitions. Having this cached could be
         | super useful to us.
        
           | BonoboIO wrote:
           | 100k ... I hear and use all of these high token context LLMs,
           | but often they fail to include all information that should be
           | in their context.
           | 
           | Does your approach work for you?
        
           | social_quotient wrote:
           | Would love if you could de identify and share a bit more
           | detail on a prompt that big.
        
         | Sakos wrote:
         | It says the timeout is refreshed if there's a cache hit within
         | the 5 minutes, and the base cost is 10% of what it would be
         | otherwise. Seems pretty damn useful to me. What seems useless
         | to you exactly?
         | 
         | I'm primarily limited by how much context I need for my
         | queries, and for the majority of the time, the context can
         | often largely be the same across multiple queries over periods
         | of 1-60 minutes. This is the case whether it's a codebase I'm
         | working with or a PDF (or other form of text documentation).
         | 
         | Simple queries are where I expect there to be the least gain
         | for this kind of thing.
        
         | Tiberium wrote:
         | >it assumes the exact same context window across requests That
         | is not true, caching works across multiple requests, that's why
         | it's so good. You can do 5 different concurrent requests and
         | they'll all get cached and cache read if the cache is still
         | warm for them.
        
         | knallfrosch wrote:
         | Think beyond AI coding assistants. JSON schema definitions,
         | FAQs, product manuals, game instructions, game state, querying
         | a student's thesis.. anything where users query a chatbot
         | information about something specific.
        
       | mathgeek wrote:
       | Direct link to the feature referenced:
       | https://docs.anthropic.com/en/docs/build-with-claude/prompt-...
        
       | Scene_Cast2 wrote:
       | Why does prompt caching reduce costs? I'm assuming that the
       | primary cost driver is GPU/TPU FLOPS, as opposed to any network /
       | storage / etc costs.
       | 
       | My understanding is that an LLM will take in the stream of text,
       | tokenize it (can be faster with caching, sure, but it's a minor
       | drop in the bucket), then run a transformer on the entire
       | sequence. You can't just cache the output of a transformer on a
       | prefix to reduce workload.
        
         | maged wrote:
         | Why not? It's caching the state of the model after the cached
         | prefix, so that inference workload doesn't need to be run
         | again.
        
         | pclmulqdq wrote:
         | You actually can cache the "output" of a transformer on the
         | prefix by caching what happens in the attention layer for that
         | text string (specifically the "K" and "V" tensors). Since the
         | attention layer is a big part of the compute cost of the
         | transformer, this does cut down FLOPs dramatically.
        
           | Scene_Cast2 wrote:
           | Oh interesting, didn't know. How does this work past the
           | first transformer in the stack?
        
             | lonk11 wrote:
             | My understanding is that the attention in all transformer
             | layers is "causal" - that is the output of a transformer
             | layer for token N depends only on tokens from 0 to N.
             | 
             | This means that every attention layer can use previously
             | calculated outputs for the same prompt prefix. So it only
             | needs to calculate from scratch starting from the first
             | unique token in the prompt sequence.
        
             | danielmarkbruce wrote:
             | I had the same question... my guess is you can do a layer
             | by layer cache. Ie a cache in the first layer, then another
             | independent second layer cache, and so on.
        
         | logicchains wrote:
         | The transformer only looks backwards, so if the first part of
         | the sequence (the prompt) doesn't change, you don't need to
         | rerun it again on that part, just on the part after it that
         | changed. For use cases with large prompts relative to the
         | output size (e.g. lots of examples in the prompt), this can
         | significantly speed up the workload.
        
           | jncfhnb wrote:
           | I think most architectures do a layer of normalization on the
           | whole text embeddings before calculating attention which
           | makes this infeasible
           | 
           | Shouldn't be a huge deal to adjust imo
           | 
           | One of the bigger problems is that closed model providers
           | don't want to expose the embedding space and let's users see
           | what they have
        
             | danielmarkbruce wrote:
             | I don't think the normalization makes it infeasible. They
             | should be able to make an adjustment (the reverse of the
             | normalization) in one operation. I think they are caching
             | the attention calcs.
             | 
             | The hard thing (I think) is what to keep in the cache and
             | where to keep it given you are serving lots of customers
             | and the attention calc can be a large set of numbers pretty
             | quickly.
        
         | burtonator wrote:
         | Autoregressive models can't just resume so they have to re-
         | parse the entire prompt again for each execution.
         | 
         | By caching them they resume from where it left off from before
         | thereby completely bypassing all that computation.
         | 
         | For large contexts this could save a ton of compute!
         | 
         | I think this feature and structured outputs are some of the
         | biggest inventions in LLMs this year.
        
           | minimaxir wrote:
           | Prompt caching has been a thing for LLMs since GPT-2 (e.g.
           | transformers's `use_past=True`), it's more of a surprise that
           | it took this long for the main LLM providers to provide a
           | good implementation.
        
             | brylie wrote:
             | I'm building an app with OpenAI, using structured outputs.
             | Does OpenAI also support prompt caching?
        
               | minimaxir wrote:
               | Not currently.
        
               | cma wrote:
               | I'm sure internally they use it for the system prompt at
               | least, probably since launch. And maybe for common
               | initial user queries that exactly match.
        
               | Onavo wrote:
               | They are certainly not passing the savings on to the
               | users.
        
               | minimaxir wrote:
               | Yet. I suspect OpenAI will release a similar offering
               | soon. (hooray, free market competition!)
        
         | danielmarkbruce wrote:
         | They cache the results of the attention calc. For certain
         | subsets which are common this makes a lot of sense. I'm
         | surprised they can make it work though, given they are serving
         | so many different users. Someone somewhere did some very clever
         | engineering.
        
       | nprateem wrote:
       | I just tried Claude the other day. What a breath of fresh air
       | after fighting the dogshit that is OpenAI.
       | 
       | Far less "in the realm of", "in today's fast-moving...",
       | multifaceted, delve or other pretentious wank.
       | 
       | There is still some though so they obviously used the same
       | dataset that's overweight in academic papers. Still, I'm hopeful
       | I can finally get it to write stuff that doesn't sound like AI
       | garbage.
       | 
       | Kind of weird there's no moderation API though. Will they just
       | cut me off if my customers try to write about things they don't
       | like?
        
         | vellum wrote:
         | You can try AWS Bedrock or Openrouter if that happens. They
         | both have the Claude API.
        
         | bastawhiz wrote:
         | > Will they just cut me off if my customers try to write about
         | things they don't like?
         | 
         | The response you get back will have a refusal, which is pretty
         | standard
        
       | verdverm wrote:
       | FWIW, Gemini / Vertex has this as well and lets you control the
       | TTL. Billing is based on how long you keep the context
       | 
       | https://ai.google.dev/gemini-api/docs/caching?lang=python
       | 
       | Costs $1 / 1M / 1h
        
         | bastawhiz wrote:
         | That pricing is ridiculous. A token is essentially a 32 bit
         | integer. Four bytes. A million tokens is 4MB. Imagine paying
         | $1/hr for less than the storage of three floppies. That's two
         | million times more expensive than the storage cost of standard
         | S3 (720 hoursx256M tokens (1gb)x$1 vs $0.09). Or 2000 times
         | more expensive than the storage cost of Elasticache serverless.
         | 
         | (Yes, I realize it's probably more than 4MB, but it's still an
         | outrageously high markup. They could do their own caching, not
         | tell you they're doing it, and keep the difference and make
         | even more money)
        
           | IanCal wrote:
           | Well that really depends where you're caching the data.
           | 
           | Is it a lot for caching in L1 on a chip somewhere? No that'd
           | be wildly cheap.
           | 
           | Is it a lot for "caching" on a tape somewhere? No.
           | 
           | So where on this scale does keeping it quick to get to gpu
           | memory lie?
           | 
           | > That's two million times more expensive than the storage
           | cost of standard S3 (
           | 
           | You're not comparing to s3 at all.
        
             | bastawhiz wrote:
             | "RAM near a GPU" is ~the same cost as "RAM near literally
             | any other piece of hardware". Even if it has to traverse
             | the network, that's a fairly low, fairly fixed (in, say,
             | the same rack) cost. Hell, it's probably even fast enough
             | to use an NVME disk.
             | 
             | Google can search the entire Internet in a fraction of a
             | second, they can keep a million tokens within a few dozen
             | milliseconds of a GPU for less than a dollar an hour.
        
               | IanCal wrote:
               | Is that fast enough? And how much data is being stored?
               | They're not storing the tokens you pass on but the
               | activations after processing them. I'll take a wild stab
               | that the activations for Claude 3.5 aren't anywhere near
               | 4 meg.
        
           | GaggiX wrote:
           | What is stored is not the tokens, but all keys and values of
           | all attention layers for each token.
        
       | rglover wrote:
       | This is great news. Using Claude to build a new SaaS [1] and this
       | will likely save me quite a bit on API costs.
       | 
       | [1] https://x.com/codewithparrot
        
       | MaximusLegroom wrote:
       | I guess they got tired of losing customers to Deepseek. They
       | introduced this feature a while ago and their prices were already
       | miniscule given that they only have to compute 20B active
       | parameters.
        
       ___________________________________________________________________
       (page generated 2024-08-18 23:00 UTC)