[HN Gopher] Batch Mode in the Gemini API: Process More for Less
       ___________________________________________________________________
        
       Batch Mode in the Gemini API: Process More for Less
        
       Author : xnx
       Score  : 159 points
       Date   : 2025-07-07 16:30 UTC (4 days ago)
        
 (HTM) web link (developers.googleblog.com)
 (TXT) w3m dump (developers.googleblog.com)
        
       | tripplyons wrote:
       | For those who aren't aware, OpenAI has a very similar batch mode
       | (50% discount if you wait up to 24 hours):
       | https://platform.openai.com/docs/api-reference/batch
       | 
       | It's nice to see competition in this space. AI is getting cheaper
       | and cheaper!
        
         | fantispug wrote:
         | Yes, this seems to be a common capability - Anthropic and
         | Mistral have something very similar as do resellers like AWS
         | Bedrock.
         | 
         | I guess it lets them better utilise their hardware in quiet
         | times throughout the day. It's interesting they all picked 50%
         | discount.
        
           | qrian wrote:
           | Bedrock has a batch mode but only for claude 3.5 which is
           | like one year old, which isn't very useful.
        
           | calaphos wrote:
           | Inference throughout scales really well with larger batch
           | sizes (at the cost of latency) due to rising arithmetic
           | intensity and the fact that it's almost always memory BW
           | limited.
        
           | briangriffinfan wrote:
           | 50% is my personal threshold for a discount going from not
           | worth it to worth it.
        
         | bayesianbot wrote:
         | DeepSeek has gone a bit different route - they give automatic
         | 75% discount between UTC 16:30-00:30
         | 
         | https://api-docs.deepseek.com/quick_start/pricing
        
         | dlvhdr wrote:
         | The latest price increases beg to differ
        
           | dmos62 wrote:
           | What price increases?
        
             | rvnx wrote:
             | I guess the Gemini price increase
        
               | dmos62 wrote:
               | Ah, 2.5 flash non-thinking price was increased to match
               | the price of 2.5 flash thinking.
        
               | Workaccount2 wrote:
               | No, 2.5 flash non-thinking was replaced with 2.5 flash
               | lite, and 2.5 flash thinking had it's cost rebalanced
               | (input price increased/output price decreased)
               | 
               | 2.5 flash non-thinking doesn't exist anymore. People call
               | it a price increase but it's just confusion about what
               | Google did.
        
               | sunaookami wrote:
               | They try to frame it as such but 2.5 Flash Lite is not
               | the same as 2.5 Flash without thinking. It's worse.
        
           | dist-epoch wrote:
           | Only because Flash was mispriced to start with. It was set
           | too cheap compared with its capabilities. They didn't raise
           | the price of Pro.
        
         | laborcontract wrote:
         | One open secret is that batch mode generations often take much
         | less than 24 hours. I've done a lot of generations where I get
         | my results within 5ish minutes.
        
           | ridgewell wrote:
           | It can depend a lot on the shape of your batch to my
           | understanding. A small batch job can be tasked out a lot
           | quicker than a large batch job waiting for just the right
           | moment where capacity fits.
        
       | dsjoerg wrote:
       | We used the previous version of this batch mode, which went
       | through BigQuery. It didn't work well for us at the time because
       | we were in development mode and we needed faster cycle time to
       | iterate and learn. Sometimes the response would come back much
       | faster than 24 hours, but sometimes not. There was no visibility
       | offered into what response time you would get; just submit and
       | wait.
       | 
       | You have to be pretty darn sure that your job is going to do
       | exactly what you want to be able to wait 24 hours for a response.
       | It's like going back to the punched-card era. If I could get even
       | 1% of the batch in a quicker response and then the rest more
       | slowly, that would have made a big difference.
        
         | cpard wrote:
         | It seems that the 24h SLA is standard for batch inference among
         | the vendors and I wonder how useful it can be when you have no
         | visibility on when the job will be delivered.
         | 
         | I wonder why they do that and who is actually getting value out
         | of these batch APIs.
         | 
         | Thanks for sharing your experience!
        
           | vineyardmike wrote:
           | It's like most batch processes, it's not useful if you don't
           | know what the response will be and you're iterating
           | interactively. It for data pipelines, analytics workloads,
           | etc, you can handle that delay because no one is waiting on
           | the response.
           | 
           | I'm a developer working on a product that lets users upload
           | content. This upload is not time sensitive. We pass the
           | content through a review pipeline, where we did moderation
           | and analysis, and some business-specific checks that the user
           | uploaded relevant content. We're migrating some of that to an
           | LLM based approach because (in testing) the results are just
           | as good, and tweaking a prompt is easier than updating code.
           | We'll probably use a batch API for this and accept that
           | content can take 24 hours to be audited.
        
             | cpard wrote:
             | yeah I get that part of batch, but even with batch
             | processing, you usually want to have some kind of sense of
             | when the data will be done. Especially when downstream
             | processes depend on that.
             | 
             | The other part that I think makes batch LLM inference
             | unique, is that the results are not deterministic. That's
             | where I think what the parent was saying about some of the
             | data at least should be available earlier even if the rest
             | will be available in 24h.
        
           | 3eb7988a1663 wrote:
           | Think of it like you have a large queue of work to be done
           | (eg summarize N decades of historical documents). There is
           | little urgency to the outcome because the bolus is so large.
           | You just want to maintain steady progress on the backlog
           | where cost optimization is more important than timing.
        
             | cpard wrote:
             | yes, what you describe feels like a one off job that you
             | want to run, which is big and also not time critical.
             | 
             | Here's an example:
             | 
             | If you are a TV broadcaster and you want to summarize and
             | annotate the content generated in the past 12 hours you
             | most probably need to have access to the summaries of the
             | previous 12 hours too.
             | 
             | Now if you submit a batch job for the first 12 hours of
             | content, you might end up in a situation where you want to
             | process the next batch but the previous one is not
             | delivered yet.
             | 
             | And imo that's fine as long as you somehow know that it
             | will take more than 12h to complete but it might be
             | delivered to you in 1h or in 23h.
             | 
             | That's the part of the these batch APIs that I find hard to
             | understand how you use in a production environment outside
             | of one off jobs.
        
           | YetAnotherNick wrote:
           | Contrary to other comments it's likely not because of queue
           | or general batch reasons. I think it is because that LLMs are
           | unique in the sense that it requires lot of fixed nodes
           | because of vRAM requirements and hence it is harder to
           | autoscale. So likely the batch jobs are executed when they
           | have free resources from interactive servers.
        
             | cpard wrote:
             | that makes total sense and what it entails is that
             | interactive inference >>> batch inference in the market
             | today in terms of demand.
        
             | dekhn wrote:
             | Yes, almost certainly in this case Google sees traffic die
             | off when a data center is in the dark. Specifically, there
             | is a diurnal cycle of traffic, and Google usually routes
             | users to close-by resources. So, late at night, all those
             | backends which were running hot doing low-latency replies
             | to users in near-real-time can instead switch over to
             | processing batches. When I built an idle cycle harvester at
             | google, I thought most of hte free cycles would come from
             | low-usage periods, but it turned out that some clusters
             | were just massively underutilized and had free resources
             | all 24 hours.
        
           | jampa wrote:
           | > who is actually getting value out of these batch APIs
           | 
           | I used the batch API extensively for my side project, where I
           | wanted to ingest a large amount of images, extract
           | descriptions, and create tags for searching. After you get
           | the right prompt, and the output is good, you can just use
           | the Batch API for your pipeline. For any non-time-sensitive
           | operations, it is excellent.
        
             | cpard wrote:
             | What you describe makes total sense. I think that the
             | tricky part is the "non-time-sensitive operations", in an
             | environment where even if you don't care to have results in
             | minutes, you have pipelines that run regularly and there
             | are dependencies on them.
             | 
             | Maybe I'm just thinking too much in data engineering terms
             | here.
        
           | dist-epoch wrote:
           | > you have no visibility on when the job will be delivered
           | 
           | You do have - within 24 hours. So don't submit requests you
           | need in 10 hours.
        
         | serjester wrote:
         | We've submitted tens of millions of requests at a time and
         | never had it take longer than a couple hours - I think the zone
         | you submit to plays a role.
        
         | Jensson wrote:
         | > If I could get even 1% of the batch in a quicker response and
         | then the rest more slowly, that would have made a big
         | difference.
         | 
         | You can do this, just send 1% using the regular API.
        
           | Implicated wrote:
           | I was also rather puzzled at this comment - why not dev
           | against real time endpoints and batch when you've got things
           | where you need them?
        
         | lazharichir wrote:
         | You can also do gemini flash lite for a subset and then batch
         | the rest with flash or pro
        
       | nnx wrote:
       | It would be nice if OpenRouter supported batch mode too, sending
       | a batch and letting OpenRouter find the best provider for the
       | batch within given price and response time.
        
       | pugio wrote:
       | Hah, I've been wrestling with this ALL DAY. Another example of
       | Phenomenal Cosmic Powers (AI) combined with itty bitty docs
       | (typical of Google). The main endpoint ("https://generativelangua
       | ge.googleapis.com/v1beta/models/gemi...") doesn't even have
       | actual REST documentation in the API. The Python API has 3
       | different versions of the same types. One of the main ones
       | (`GenerateContentRequest`) isn't available in the newest path
       | (`google.genai.types`) so you need to find it in an older
       | version, but then you start getting version mismatch errors, and
       | then pydantic errors, until you finally decide to just cross your
       | fingers and submit raw JSON, only to get opaque API errors.
       | 
       | So, if anybody else is frustrated and not finding anything online
       | about this, here are a few things I learned, specifically for
       | structured output generation (which is a main use case for
       | batching) - the individual request JSON should resolve to this:
       | 
       | ```json { "request": { "contents": [ { "parts": [ { "text": "Give
       | me the main output please" } ] } ], "system_instruction": {
       | "parts": [ { "text": "You are a main output maker." } ] },
       | "generation_config": { "response_mime_type": "application/json",
       | "response_json_schema": { "type": "object", "properties": {
       | "output1": { "type": "string" }, "output2": { "type": "string" }
       | }, "required": [ "output1", "output2" ] } } }, "metadata": {
       | "key": "my_id" } } ```
       | 
       | To get actual structured output, don't just do
       | `generation_config.response_schema`, you need to include the
       | mime-type, and the key should be `response_json_schema`. Any
       | other combination will either throw opaque errors or won't
       | trigger Structured Output (and will contain the usual LLM intros
       | "I'm happy to do this for you...").
       | 
       | So you upload a .jsonl file with the above JSON, and then you try
       | to submit it for a batch job. If something is wrong with your
       | file, you'll get a "400" and no other info. If something is wrong
       | with the request submission you'll get a 400 with "Invalid JSON
       | payload received. Unknown name \"file_name\" at
       | 'batch.input_config.requests': Cannot find field."
       | 
       | I got the above error endless times when trying _their exact
       | sample code_ : ``` BATCH_INPUT_FILE='files/123456' # File ID curl
       | https://generativelanguage.googleapis.com/v1beta/models/gemi... \
       | -X POST \ -H "x-goog-api-key: $GEMINI_API_KEY" \ -H "Content-
       | Type:application/json" \ -d "{ 'batch': { 'display_name': 'my-
       | batch-requests', 'input_config': { 'requests': { 'file_name':
       | ${BATCH_INPUT_FILE} } } } }" ```
       | 
       | Finally got the job submission working via the python api
       | (`file_batch_job = client.batches.create()`), but remember, if
       | something is wrong with the file you're submitting, they won't
       | tell you what, or how.
        
         | TheTaytay wrote:
         | Thank you for posting this! (When I run into errors with posted
         | sample code, I spend WAY too long assuming it's my fault.)
        
       | great_psy wrote:
       | Is this an indication of the peak of the AI bubble ?
       | 
       | In a way this is saying that there are some GPUs just sitting
       | around so they would rather get 50% than nothing for their use.
        
         | graeme wrote:
         | Seems more like electricity pricing, which has peak and offpeak
         | pricing for most business customers.
         | 
         | To handle peak daily load you _need_ capacity that goes unused
         | in offpeak hours.
        
         | reasonableklout wrote:
         | Why do you think that this means "idle GPU" rather than a
         | company recognizing a growing need and allocating resources
         | toward it?
         | 
         | It's cheaper because it's a different market with different
         | needs which can be served by systems optimizing for throughput
         | instead latency. Feels like you're looking for something that's
         | not there.
        
       | dmitry-vsl wrote:
       | Is it possible to use batch mode with fine-tuned models?
        
       | segalord wrote:
       | Man googles offerings are so inconsistent, batch processing has
       | been available on vertex for a while now, I dont really get why
       | they have two different offering in vertex and gemini, both are
       | equally inaccessible
        
         | nikolayasdf123 wrote:
         | omg I realized this is not Vertex AI _face-palm_
        
         | rockwotj wrote:
         | It's because vertex is the "entrrprise" offering that is hippa
         | compliant, etc. That is why vertex only has explicit prompt
         | caching and not implicit, etc. Vertex usage is never used for
         | training or model feedback, but the gemini API does. Basically
         | the Gemini API is Google's way of being able to move faster
         | like openai and the other foundational model providers, but
         | still having an enterprise offering. Go check Anthropic's
         | documentation, they even say if you have enterprise or
         | regulatory needs go use bedrock or vertex.
        
           | Deathmax wrote:
           | Vertex's offering of Gemini very much does implicit caching,
           | and has always been the case [1]. The recent addition of
           | applying implicit cache hit discounts also works on Vertex,
           | as long as you don't use the `global` endpoint and hit one of
           | the regional endpoints.
           | 
           | [1]: http://web.archive.org/web/20240517173258/https://cloud.
           | goog..., "By default Google caches a customer's inputs and
           | outputs for Gemini models to accelerate responses to
           | subsequent prompts from the customer. Cached contents are
           | stored for up to 24 hours."
        
       | druskacik wrote:
       | I've been using OpenAI's batch API for some time, then replaced
       | it with Mistral's batch API because it was cheaper (Mistral Small
       | with $0.10 / $0.20 per million tokens was perfect for my use
       | case). This makes me rethink my choice, e.g. Gemini 2.5 Flash-
       | Lite seems to be a better model[0] with only a slight price
       | increase.
       | 
       | [0] https://artificialanalysis.ai/leaderboards/models
        
       | tucnak wrote:
       | I really hope it means that 2.5 models will be available for
       | batching in Vertex, too. We had spent quite a bit of effort
       | making it work with BigQuery, and it's really cool when it works.
       | There's edge-case, though, where it doesn't work: in case the
       | batch is also referring to cached prompt. We did report this a
       | few months ago.
        
       | anupj wrote:
       | Batch Mode for the Gemini API feels like Google's way of asking,
       | "What if we made AI more affordable and slower, but at massive
       | scale?" Now you can process 10,000 prompts like "Summarize each
       | customer review in one line" for half the cost, provided you're
       | willing to wait until tomorrow for the results.
        
         | dist-epoch wrote:
         | Most LLM providers have batch mode. Not sure why you are
         | calling them out.
        
           | okdood64 wrote:
           | I'll take it further. Regular cloud compute have batch
           | workload capabilities at cheaper rates, as well since
           | forever.
        
         | diggan wrote:
         | > Now you can process 10,000 prompts like "Summarize each
         | customer review in one line" for half the cost, provided you're
         | willing to wait until tomorrow for the results.
         | 
         | Sounds like a great option to have available? Not every task I
         | use LLMs for need immediate responses, and if I wasn't using
         | local models for those things, getting a 50% discount and
         | having to wait a day sounds like a fine tradeoff.
        
         | XTXinverseXTY wrote:
         | This is an extremely common use case.
         | 
         | Reading your comment history: are you an LLM?
         | 
         | https://news.ycombinator.com/item?id=44531907
         | 
         | https://news.ycombinator.com/item?id=44531868
        
         | okdood64 wrote:
         | I don't understand the point you're making. This has been a
         | commonly used offering since cloud blew up.
         | 
         | https://aws.amazon.com/ec2/spot/
        
       | kerisi wrote:
       | I've been using this with nothing notable to mention besides
       | there seems to be a common bug where you receive an empty text
       | response.
       | 
       | https://discuss.ai.google.dev/t/gemini-2-5-pro-with-empty-re...
        
       | lopuhin wrote:
       | I find OpenAI's new flex processing more attractive, as it has
       | the same 50% discount, but allows to use the same API as regular
       | chat mode, so you can still do stuff where Batch API won't work
       | (e.g. evaluating agents), and in practice I found it to work well
       | enough when paired with client-side request caching:
       | https://platform.openai.com/docs/guides/flex-processing?api-...
        
         | irthomasthomas wrote:
         | It's nice that they stack the batch pricing and caching
         | discount. I asked the Google guy if they did the same but got
         | no reply, so probably not.
         | 
         | Edit: anthropic also stack batching and caching discounts
        
       ___________________________________________________________________
       (page generated 2025-07-11 23:01 UTC)