[HN Gopher] Claude just slashed the cost of building AI applicat...
___________________________________________________________________
Claude just slashed the cost of building AI applications
Author : fallinditch
Score : 67 points
Date : 2024-08-18 19:14 UTC (3 hours ago)
(HTM) web link (www.indiehackers.com)
(TXT) w3m dump (www.indiehackers.com)
| NBJack wrote:
| Sounds kinda useless, TBH. This sounds as if it assumes the exact
| same context window across requests. If so, given the 5 minute
| window, unless for example your entire team is operating in the
| same codebase at the same time, you won't really see any savings
| beyond simple prompts.
|
| Are contexts included in the prompt cache? Are they identified as
| the same or not? What happens if we approach the 10k token range?
| 128k? 1M?
| minimaxir wrote:
| The documentation is here and has better examples:
| https://docs.anthropic.com/en/docs/build-with-claude/prompt-...
|
| tl;dr can cache system prompts, tools, and user messages (up to
| 4 total) with better returns on massive inputs such as
| documents.
|
| The use case is more for client-facing applications that would
| hit the cache frequently rather than internal copilots.
| calendarsnack wrote:
| Here is a simple use-case that comes to mind. Let's say you
| have a medium-size repository, such that all the source files
| can fit in the context. You want to use Claude as an advanced
| auto-complete in a certain file, but it's important that it has
| visibility to other files in the repo.
|
| You can put all the other source files in an initial system
| message and the current file in the user message. Then, if you
| call multiple autocompletes within 5 minutes of each other, you
| pay a drastically reduced price for including all of the other
| files in the context. Also, the latency is much reduced!
|
| Yes, you could probably get a similar outcome by incorporating
| RAG, search tools etc to the autocomplete, but as a simple
| approach with fewer moving parts, the caching will reduce costs
| for this setup.
| config_yml wrote:
| I have a system prompt of around 100k which includes a couple
| of internal schema definitions. Having this cached could be
| super useful to us.
| BonoboIO wrote:
| 100k ... I hear and use all of these high token context LLMs,
| but often they fail to include all information that should be
| in their context.
|
| Does your approach work for you?
| social_quotient wrote:
| Would love if you could de identify and share a bit more
| detail on a prompt that big.
| Sakos wrote:
| It says the timeout is refreshed if there's a cache hit within
| the 5 minutes, and the base cost is 10% of what it would be
| otherwise. Seems pretty damn useful to me. What seems useless
| to you exactly?
|
| I'm primarily limited by how much context I need for my
| queries, and for the majority of the time, the context can
| often largely be the same across multiple queries over periods
| of 1-60 minutes. This is the case whether it's a codebase I'm
| working with or a PDF (or other form of text documentation).
|
| Simple queries are where I expect there to be the least gain
| for this kind of thing.
| Tiberium wrote:
| >it assumes the exact same context window across requests That
| is not true, caching works across multiple requests, that's why
| it's so good. You can do 5 different concurrent requests and
| they'll all get cached and cache read if the cache is still
| warm for them.
| knallfrosch wrote:
| Think beyond AI coding assistants. JSON schema definitions,
| FAQs, product manuals, game instructions, game state, querying
| a student's thesis.. anything where users query a chatbot
| information about something specific.
| mathgeek wrote:
| Direct link to the feature referenced:
| https://docs.anthropic.com/en/docs/build-with-claude/prompt-...
| Scene_Cast2 wrote:
| Why does prompt caching reduce costs? I'm assuming that the
| primary cost driver is GPU/TPU FLOPS, as opposed to any network /
| storage / etc costs.
|
| My understanding is that an LLM will take in the stream of text,
| tokenize it (can be faster with caching, sure, but it's a minor
| drop in the bucket), then run a transformer on the entire
| sequence. You can't just cache the output of a transformer on a
| prefix to reduce workload.
| maged wrote:
| Why not? It's caching the state of the model after the cached
| prefix, so that inference workload doesn't need to be run
| again.
| pclmulqdq wrote:
| You actually can cache the "output" of a transformer on the
| prefix by caching what happens in the attention layer for that
| text string (specifically the "K" and "V" tensors). Since the
| attention layer is a big part of the compute cost of the
| transformer, this does cut down FLOPs dramatically.
| Scene_Cast2 wrote:
| Oh interesting, didn't know. How does this work past the
| first transformer in the stack?
| lonk11 wrote:
| My understanding is that the attention in all transformer
| layers is "causal" - that is the output of a transformer
| layer for token N depends only on tokens from 0 to N.
|
| This means that every attention layer can use previously
| calculated outputs for the same prompt prefix. So it only
| needs to calculate from scratch starting from the first
| unique token in the prompt sequence.
| danielmarkbruce wrote:
| I had the same question... my guess is you can do a layer
| by layer cache. Ie a cache in the first layer, then another
| independent second layer cache, and so on.
| logicchains wrote:
| The transformer only looks backwards, so if the first part of
| the sequence (the prompt) doesn't change, you don't need to
| rerun it again on that part, just on the part after it that
| changed. For use cases with large prompts relative to the
| output size (e.g. lots of examples in the prompt), this can
| significantly speed up the workload.
| jncfhnb wrote:
| I think most architectures do a layer of normalization on the
| whole text embeddings before calculating attention which
| makes this infeasible
|
| Shouldn't be a huge deal to adjust imo
|
| One of the bigger problems is that closed model providers
| don't want to expose the embedding space and let's users see
| what they have
| danielmarkbruce wrote:
| I don't think the normalization makes it infeasible. They
| should be able to make an adjustment (the reverse of the
| normalization) in one operation. I think they are caching
| the attention calcs.
|
| The hard thing (I think) is what to keep in the cache and
| where to keep it given you are serving lots of customers
| and the attention calc can be a large set of numbers pretty
| quickly.
| burtonator wrote:
| Autoregressive models can't just resume so they have to re-
| parse the entire prompt again for each execution.
|
| By caching them they resume from where it left off from before
| thereby completely bypassing all that computation.
|
| For large contexts this could save a ton of compute!
|
| I think this feature and structured outputs are some of the
| biggest inventions in LLMs this year.
| minimaxir wrote:
| Prompt caching has been a thing for LLMs since GPT-2 (e.g.
| transformers's `use_past=True`), it's more of a surprise that
| it took this long for the main LLM providers to provide a
| good implementation.
| brylie wrote:
| I'm building an app with OpenAI, using structured outputs.
| Does OpenAI also support prompt caching?
| minimaxir wrote:
| Not currently.
| cma wrote:
| I'm sure internally they use it for the system prompt at
| least, probably since launch. And maybe for common
| initial user queries that exactly match.
| Onavo wrote:
| They are certainly not passing the savings on to the
| users.
| minimaxir wrote:
| Yet. I suspect OpenAI will release a similar offering
| soon. (hooray, free market competition!)
| danielmarkbruce wrote:
| They cache the results of the attention calc. For certain
| subsets which are common this makes a lot of sense. I'm
| surprised they can make it work though, given they are serving
| so many different users. Someone somewhere did some very clever
| engineering.
| nprateem wrote:
| I just tried Claude the other day. What a breath of fresh air
| after fighting the dogshit that is OpenAI.
|
| Far less "in the realm of", "in today's fast-moving...",
| multifaceted, delve or other pretentious wank.
|
| There is still some though so they obviously used the same
| dataset that's overweight in academic papers. Still, I'm hopeful
| I can finally get it to write stuff that doesn't sound like AI
| garbage.
|
| Kind of weird there's no moderation API though. Will they just
| cut me off if my customers try to write about things they don't
| like?
| vellum wrote:
| You can try AWS Bedrock or Openrouter if that happens. They
| both have the Claude API.
| bastawhiz wrote:
| > Will they just cut me off if my customers try to write about
| things they don't like?
|
| The response you get back will have a refusal, which is pretty
| standard
| verdverm wrote:
| FWIW, Gemini / Vertex has this as well and lets you control the
| TTL. Billing is based on how long you keep the context
|
| https://ai.google.dev/gemini-api/docs/caching?lang=python
|
| Costs $1 / 1M / 1h
| bastawhiz wrote:
| That pricing is ridiculous. A token is essentially a 32 bit
| integer. Four bytes. A million tokens is 4MB. Imagine paying
| $1/hr for less than the storage of three floppies. That's two
| million times more expensive than the storage cost of standard
| S3 (720 hoursx256M tokens (1gb)x$1 vs $0.09). Or 2000 times
| more expensive than the storage cost of Elasticache serverless.
|
| (Yes, I realize it's probably more than 4MB, but it's still an
| outrageously high markup. They could do their own caching, not
| tell you they're doing it, and keep the difference and make
| even more money)
| IanCal wrote:
| Well that really depends where you're caching the data.
|
| Is it a lot for caching in L1 on a chip somewhere? No that'd
| be wildly cheap.
|
| Is it a lot for "caching" on a tape somewhere? No.
|
| So where on this scale does keeping it quick to get to gpu
| memory lie?
|
| > That's two million times more expensive than the storage
| cost of standard S3 (
|
| You're not comparing to s3 at all.
| bastawhiz wrote:
| "RAM near a GPU" is ~the same cost as "RAM near literally
| any other piece of hardware". Even if it has to traverse
| the network, that's a fairly low, fairly fixed (in, say,
| the same rack) cost. Hell, it's probably even fast enough
| to use an NVME disk.
|
| Google can search the entire Internet in a fraction of a
| second, they can keep a million tokens within a few dozen
| milliseconds of a GPU for less than a dollar an hour.
| IanCal wrote:
| Is that fast enough? And how much data is being stored?
| They're not storing the tokens you pass on but the
| activations after processing them. I'll take a wild stab
| that the activations for Claude 3.5 aren't anywhere near
| 4 meg.
| GaggiX wrote:
| What is stored is not the tokens, but all keys and values of
| all attention layers for each token.
| rglover wrote:
| This is great news. Using Claude to build a new SaaS [1] and this
| will likely save me quite a bit on API costs.
|
| [1] https://x.com/codewithparrot
| MaximusLegroom wrote:
| I guess they got tired of losing customers to Deepseek. They
| introduced this feature a while ago and their prices were already
| miniscule given that they only have to compute 20B active
| parameters.
___________________________________________________________________
(page generated 2024-08-18 23:00 UTC)