[HN Gopher] Batch Mode in the Gemini API: Process More for Less
___________________________________________________________________
Batch Mode in the Gemini API: Process More for Less
Author : xnx
Score : 159 points
Date : 2025-07-07 16:30 UTC (4 days ago)
(HTM) web link (developers.googleblog.com)
(TXT) w3m dump (developers.googleblog.com)
| tripplyons wrote:
| For those who aren't aware, OpenAI has a very similar batch mode
| (50% discount if you wait up to 24 hours):
| https://platform.openai.com/docs/api-reference/batch
|
| It's nice to see competition in this space. AI is getting cheaper
| and cheaper!
| fantispug wrote:
| Yes, this seems to be a common capability - Anthropic and
| Mistral have something very similar as do resellers like AWS
| Bedrock.
|
| I guess it lets them better utilise their hardware in quiet
| times throughout the day. It's interesting they all picked 50%
| discount.
| qrian wrote:
| Bedrock has a batch mode but only for claude 3.5 which is
| like one year old, which isn't very useful.
| calaphos wrote:
| Inference throughout scales really well with larger batch
| sizes (at the cost of latency) due to rising arithmetic
| intensity and the fact that it's almost always memory BW
| limited.
| briangriffinfan wrote:
| 50% is my personal threshold for a discount going from not
| worth it to worth it.
| bayesianbot wrote:
| DeepSeek has gone a bit different route - they give automatic
| 75% discount between UTC 16:30-00:30
|
| https://api-docs.deepseek.com/quick_start/pricing
| dlvhdr wrote:
| The latest price increases beg to differ
| dmos62 wrote:
| What price increases?
| rvnx wrote:
| I guess the Gemini price increase
| dmos62 wrote:
| Ah, 2.5 flash non-thinking price was increased to match
| the price of 2.5 flash thinking.
| Workaccount2 wrote:
| No, 2.5 flash non-thinking was replaced with 2.5 flash
| lite, and 2.5 flash thinking had it's cost rebalanced
| (input price increased/output price decreased)
|
| 2.5 flash non-thinking doesn't exist anymore. People call
| it a price increase but it's just confusion about what
| Google did.
| sunaookami wrote:
| They try to frame it as such but 2.5 Flash Lite is not
| the same as 2.5 Flash without thinking. It's worse.
| dist-epoch wrote:
| Only because Flash was mispriced to start with. It was set
| too cheap compared with its capabilities. They didn't raise
| the price of Pro.
| laborcontract wrote:
| One open secret is that batch mode generations often take much
| less than 24 hours. I've done a lot of generations where I get
| my results within 5ish minutes.
| ridgewell wrote:
| It can depend a lot on the shape of your batch to my
| understanding. A small batch job can be tasked out a lot
| quicker than a large batch job waiting for just the right
| moment where capacity fits.
| dsjoerg wrote:
| We used the previous version of this batch mode, which went
| through BigQuery. It didn't work well for us at the time because
| we were in development mode and we needed faster cycle time to
| iterate and learn. Sometimes the response would come back much
| faster than 24 hours, but sometimes not. There was no visibility
| offered into what response time you would get; just submit and
| wait.
|
| You have to be pretty darn sure that your job is going to do
| exactly what you want to be able to wait 24 hours for a response.
| It's like going back to the punched-card era. If I could get even
| 1% of the batch in a quicker response and then the rest more
| slowly, that would have made a big difference.
| cpard wrote:
| It seems that the 24h SLA is standard for batch inference among
| the vendors and I wonder how useful it can be when you have no
| visibility on when the job will be delivered.
|
| I wonder why they do that and who is actually getting value out
| of these batch APIs.
|
| Thanks for sharing your experience!
| vineyardmike wrote:
| It's like most batch processes, it's not useful if you don't
| know what the response will be and you're iterating
| interactively. It for data pipelines, analytics workloads,
| etc, you can handle that delay because no one is waiting on
| the response.
|
| I'm a developer working on a product that lets users upload
| content. This upload is not time sensitive. We pass the
| content through a review pipeline, where we did moderation
| and analysis, and some business-specific checks that the user
| uploaded relevant content. We're migrating some of that to an
| LLM based approach because (in testing) the results are just
| as good, and tweaking a prompt is easier than updating code.
| We'll probably use a batch API for this and accept that
| content can take 24 hours to be audited.
| cpard wrote:
| yeah I get that part of batch, but even with batch
| processing, you usually want to have some kind of sense of
| when the data will be done. Especially when downstream
| processes depend on that.
|
| The other part that I think makes batch LLM inference
| unique, is that the results are not deterministic. That's
| where I think what the parent was saying about some of the
| data at least should be available earlier even if the rest
| will be available in 24h.
| 3eb7988a1663 wrote:
| Think of it like you have a large queue of work to be done
| (eg summarize N decades of historical documents). There is
| little urgency to the outcome because the bolus is so large.
| You just want to maintain steady progress on the backlog
| where cost optimization is more important than timing.
| cpard wrote:
| yes, what you describe feels like a one off job that you
| want to run, which is big and also not time critical.
|
| Here's an example:
|
| If you are a TV broadcaster and you want to summarize and
| annotate the content generated in the past 12 hours you
| most probably need to have access to the summaries of the
| previous 12 hours too.
|
| Now if you submit a batch job for the first 12 hours of
| content, you might end up in a situation where you want to
| process the next batch but the previous one is not
| delivered yet.
|
| And imo that's fine as long as you somehow know that it
| will take more than 12h to complete but it might be
| delivered to you in 1h or in 23h.
|
| That's the part of the these batch APIs that I find hard to
| understand how you use in a production environment outside
| of one off jobs.
| YetAnotherNick wrote:
| Contrary to other comments it's likely not because of queue
| or general batch reasons. I think it is because that LLMs are
| unique in the sense that it requires lot of fixed nodes
| because of vRAM requirements and hence it is harder to
| autoscale. So likely the batch jobs are executed when they
| have free resources from interactive servers.
| cpard wrote:
| that makes total sense and what it entails is that
| interactive inference >>> batch inference in the market
| today in terms of demand.
| dekhn wrote:
| Yes, almost certainly in this case Google sees traffic die
| off when a data center is in the dark. Specifically, there
| is a diurnal cycle of traffic, and Google usually routes
| users to close-by resources. So, late at night, all those
| backends which were running hot doing low-latency replies
| to users in near-real-time can instead switch over to
| processing batches. When I built an idle cycle harvester at
| google, I thought most of hte free cycles would come from
| low-usage periods, but it turned out that some clusters
| were just massively underutilized and had free resources
| all 24 hours.
| jampa wrote:
| > who is actually getting value out of these batch APIs
|
| I used the batch API extensively for my side project, where I
| wanted to ingest a large amount of images, extract
| descriptions, and create tags for searching. After you get
| the right prompt, and the output is good, you can just use
| the Batch API for your pipeline. For any non-time-sensitive
| operations, it is excellent.
| cpard wrote:
| What you describe makes total sense. I think that the
| tricky part is the "non-time-sensitive operations", in an
| environment where even if you don't care to have results in
| minutes, you have pipelines that run regularly and there
| are dependencies on them.
|
| Maybe I'm just thinking too much in data engineering terms
| here.
| dist-epoch wrote:
| > you have no visibility on when the job will be delivered
|
| You do have - within 24 hours. So don't submit requests you
| need in 10 hours.
| serjester wrote:
| We've submitted tens of millions of requests at a time and
| never had it take longer than a couple hours - I think the zone
| you submit to plays a role.
| Jensson wrote:
| > If I could get even 1% of the batch in a quicker response and
| then the rest more slowly, that would have made a big
| difference.
|
| You can do this, just send 1% using the regular API.
| Implicated wrote:
| I was also rather puzzled at this comment - why not dev
| against real time endpoints and batch when you've got things
| where you need them?
| lazharichir wrote:
| You can also do gemini flash lite for a subset and then batch
| the rest with flash or pro
| nnx wrote:
| It would be nice if OpenRouter supported batch mode too, sending
| a batch and letting OpenRouter find the best provider for the
| batch within given price and response time.
| pugio wrote:
| Hah, I've been wrestling with this ALL DAY. Another example of
| Phenomenal Cosmic Powers (AI) combined with itty bitty docs
| (typical of Google). The main endpoint ("https://generativelangua
| ge.googleapis.com/v1beta/models/gemi...") doesn't even have
| actual REST documentation in the API. The Python API has 3
| different versions of the same types. One of the main ones
| (`GenerateContentRequest`) isn't available in the newest path
| (`google.genai.types`) so you need to find it in an older
| version, but then you start getting version mismatch errors, and
| then pydantic errors, until you finally decide to just cross your
| fingers and submit raw JSON, only to get opaque API errors.
|
| So, if anybody else is frustrated and not finding anything online
| about this, here are a few things I learned, specifically for
| structured output generation (which is a main use case for
| batching) - the individual request JSON should resolve to this:
|
| ```json { "request": { "contents": [ { "parts": [ { "text": "Give
| me the main output please" } ] } ], "system_instruction": {
| "parts": [ { "text": "You are a main output maker." } ] },
| "generation_config": { "response_mime_type": "application/json",
| "response_json_schema": { "type": "object", "properties": {
| "output1": { "type": "string" }, "output2": { "type": "string" }
| }, "required": [ "output1", "output2" ] } } }, "metadata": {
| "key": "my_id" } } ```
|
| To get actual structured output, don't just do
| `generation_config.response_schema`, you need to include the
| mime-type, and the key should be `response_json_schema`. Any
| other combination will either throw opaque errors or won't
| trigger Structured Output (and will contain the usual LLM intros
| "I'm happy to do this for you...").
|
| So you upload a .jsonl file with the above JSON, and then you try
| to submit it for a batch job. If something is wrong with your
| file, you'll get a "400" and no other info. If something is wrong
| with the request submission you'll get a 400 with "Invalid JSON
| payload received. Unknown name \"file_name\" at
| 'batch.input_config.requests': Cannot find field."
|
| I got the above error endless times when trying _their exact
| sample code_ : ``` BATCH_INPUT_FILE='files/123456' # File ID curl
| https://generativelanguage.googleapis.com/v1beta/models/gemi... \
| -X POST \ -H "x-goog-api-key: $GEMINI_API_KEY" \ -H "Content-
| Type:application/json" \ -d "{ 'batch': { 'display_name': 'my-
| batch-requests', 'input_config': { 'requests': { 'file_name':
| ${BATCH_INPUT_FILE} } } } }" ```
|
| Finally got the job submission working via the python api
| (`file_batch_job = client.batches.create()`), but remember, if
| something is wrong with the file you're submitting, they won't
| tell you what, or how.
| TheTaytay wrote:
| Thank you for posting this! (When I run into errors with posted
| sample code, I spend WAY too long assuming it's my fault.)
| great_psy wrote:
| Is this an indication of the peak of the AI bubble ?
|
| In a way this is saying that there are some GPUs just sitting
| around so they would rather get 50% than nothing for their use.
| graeme wrote:
| Seems more like electricity pricing, which has peak and offpeak
| pricing for most business customers.
|
| To handle peak daily load you _need_ capacity that goes unused
| in offpeak hours.
| reasonableklout wrote:
| Why do you think that this means "idle GPU" rather than a
| company recognizing a growing need and allocating resources
| toward it?
|
| It's cheaper because it's a different market with different
| needs which can be served by systems optimizing for throughput
| instead latency. Feels like you're looking for something that's
| not there.
| dmitry-vsl wrote:
| Is it possible to use batch mode with fine-tuned models?
| segalord wrote:
| Man googles offerings are so inconsistent, batch processing has
| been available on vertex for a while now, I dont really get why
| they have two different offering in vertex and gemini, both are
| equally inaccessible
| nikolayasdf123 wrote:
| omg I realized this is not Vertex AI _face-palm_
| rockwotj wrote:
| It's because vertex is the "entrrprise" offering that is hippa
| compliant, etc. That is why vertex only has explicit prompt
| caching and not implicit, etc. Vertex usage is never used for
| training or model feedback, but the gemini API does. Basically
| the Gemini API is Google's way of being able to move faster
| like openai and the other foundational model providers, but
| still having an enterprise offering. Go check Anthropic's
| documentation, they even say if you have enterprise or
| regulatory needs go use bedrock or vertex.
| Deathmax wrote:
| Vertex's offering of Gemini very much does implicit caching,
| and has always been the case [1]. The recent addition of
| applying implicit cache hit discounts also works on Vertex,
| as long as you don't use the `global` endpoint and hit one of
| the regional endpoints.
|
| [1]: http://web.archive.org/web/20240517173258/https://cloud.
| goog..., "By default Google caches a customer's inputs and
| outputs for Gemini models to accelerate responses to
| subsequent prompts from the customer. Cached contents are
| stored for up to 24 hours."
| druskacik wrote:
| I've been using OpenAI's batch API for some time, then replaced
| it with Mistral's batch API because it was cheaper (Mistral Small
| with $0.10 / $0.20 per million tokens was perfect for my use
| case). This makes me rethink my choice, e.g. Gemini 2.5 Flash-
| Lite seems to be a better model[0] with only a slight price
| increase.
|
| [0] https://artificialanalysis.ai/leaderboards/models
| tucnak wrote:
| I really hope it means that 2.5 models will be available for
| batching in Vertex, too. We had spent quite a bit of effort
| making it work with BigQuery, and it's really cool when it works.
| There's edge-case, though, where it doesn't work: in case the
| batch is also referring to cached prompt. We did report this a
| few months ago.
| anupj wrote:
| Batch Mode for the Gemini API feels like Google's way of asking,
| "What if we made AI more affordable and slower, but at massive
| scale?" Now you can process 10,000 prompts like "Summarize each
| customer review in one line" for half the cost, provided you're
| willing to wait until tomorrow for the results.
| dist-epoch wrote:
| Most LLM providers have batch mode. Not sure why you are
| calling them out.
| okdood64 wrote:
| I'll take it further. Regular cloud compute have batch
| workload capabilities at cheaper rates, as well since
| forever.
| diggan wrote:
| > Now you can process 10,000 prompts like "Summarize each
| customer review in one line" for half the cost, provided you're
| willing to wait until tomorrow for the results.
|
| Sounds like a great option to have available? Not every task I
| use LLMs for need immediate responses, and if I wasn't using
| local models for those things, getting a 50% discount and
| having to wait a day sounds like a fine tradeoff.
| XTXinverseXTY wrote:
| This is an extremely common use case.
|
| Reading your comment history: are you an LLM?
|
| https://news.ycombinator.com/item?id=44531907
|
| https://news.ycombinator.com/item?id=44531868
| okdood64 wrote:
| I don't understand the point you're making. This has been a
| commonly used offering since cloud blew up.
|
| https://aws.amazon.com/ec2/spot/
| kerisi wrote:
| I've been using this with nothing notable to mention besides
| there seems to be a common bug where you receive an empty text
| response.
|
| https://discuss.ai.google.dev/t/gemini-2-5-pro-with-empty-re...
| lopuhin wrote:
| I find OpenAI's new flex processing more attractive, as it has
| the same 50% discount, but allows to use the same API as regular
| chat mode, so you can still do stuff where Batch API won't work
| (e.g. evaluating agents), and in practice I found it to work well
| enough when paired with client-side request caching:
| https://platform.openai.com/docs/guides/flex-processing?api-...
| irthomasthomas wrote:
| It's nice that they stack the batch pricing and caching
| discount. I asked the Google guy if they did the same but got
| no reply, so probably not.
|
| Edit: anthropic also stack batching and caching discounts
___________________________________________________________________
(page generated 2025-07-11 23:01 UTC)