[HN Gopher] Is Gemini 2.5 good at bounding boxes?
       ___________________________________________________________________
        
       Is Gemini 2.5 good at bounding boxes?
        
       Author : simedw
       Score  : 248 points
       Date   : 2025-07-10 12:35 UTC (10 hours ago)
        
 (HTM) web link (simedw.com)
 (TXT) w3m dump (simedw.com)
        
       | EconomistFar wrote:
       | Really interesting piece, the bit about tight vs loose bounding
       | boxes got me thinking. Small inaccuracies can add up fast,
       | especially in edge cases or when training on limited data.
       | 
       | Has anyone here found good ways to handle bounding box quality in
       | noisy datasets? Do you rely more on human annotation or clever
       | augmentation?
        
         | simedw wrote:
         | Thank you! Better training data is often the key to solving
         | these issues, though it can be a costly solution.
         | 
         | In some cases, running a model like SAM 2 on a loose bounding
         | box can help refine the results. I usually add about 10%
         | padding in each direction to the bounding box, just in case the
         | original was too tight. Then if you don't actually need to mask
         | you just convert it back to a bounding box.
        
         | steinvakt2 wrote:
         | I actually did this in my paper:
         | https://scholar.google.com/scholar?cluster=14980420937479044...
        
       | serjester wrote:
       | I wrote a similar article a couple of months ago, but focusing
       | instead on PDF bounding boxes--specifically, drawing boxes around
       | content excerpts.
       | 
       | Gemini is really impressive at these kinds of object detection
       | tasks.
       | 
       | https://www.sergey.fyi/articles/using-gemini-for-precise-cit...
        
         | simedw wrote:
         | That's really interesting, thanks for sharing!
         | 
         | Are you using that approach in production for grounding when
         | PDFs don't include embedded text, like in the case of scanned
         | documents? I did some experiments for that use case, and it
         | wasn't really reaching the bar I was hoping for.
        
           | serjester wrote:
           | Yes, this was completely image-based. Not quite of a point of
           | using it in production since I agree it can be flakey at
           | times. Although I do think there's viable workarounds, like
           | sending the same prompt multiple times, and seeing if the
           | returned results overlap.
           | 
           | It really feels like we're maybe half a model generation away
           | from this being a solved problem.
        
         | svat wrote:
         | Thanks for this post -- I'm doing something similar for a
         | personal/hobby project (just trying to work with very old
         | scanned PDFs in Sanskrit etc), and the bounding box next to
         | "Sub-TOI" in your screenshot
         | (https://www.sergey.fyi/images/bboxes/annotated-filing.webp) is
         | like something I'm encountering too: it clearly "knows" that
         | there is a box of a certain width and height, but somehow the
         | box is offset from its actual location. Do you have any
         | insights into that kind of thing, and did anything you try fix
         | that?
        
           | serjester wrote:
           | I suspect this is a remnant of how images get tokenized -
           | simplest solution is probably to increase the buffer.
        
             | svat wrote:
             | What buffer are you referring to / how do you increase it?
             | And did that solution work for you (if you happened to
             | try)?
        
       | thegeomaster wrote:
       | A detail that is not mentioned is that Google models >= Gemini
       | 2.0 are all explicitly post-trained for this task of bounding box
       | detection: https://ai.google.dev/gemini-api/docs/image-
       | understanding
       | 
       | Given that the author is using the specific `box_2d` format, it
       | suggests that he is taking advantage of this feature, so I wanted
       | to highlight it. My intuition is that a base multimodal LLM
       | without this type of post-training would have much worse
       | performance.
        
         | simedw wrote:
         | That's true, it's also why I didn't benchmark against any other
         | model provider.
         | 
         | It has been tuned so heavily on this specific format that even
         | a tiny change, like switching the order in the `box_2d` format
         | from `(ymin, xmin, ymax, xmax)` to `(xmin, ymin, xmax, ymax)`
         | causes performance to tank.
        
           | pbhjpbhj wrote:
           | That's interesting because it suggests the meaning and
           | representation are very tightly linked; I would expect it to
           | be less tightly coupled given Gemini is multimodal.
        
         | demirbey05 wrote:
         | I was really shocked when I first see this but yes it's in
         | training data. Not thinking feature.
        
         | xnx wrote:
         | It's really impressive what Gemini models can do. Segmentation
         | too! https://ai.google.dev/gemini-api/docs/image-
         | understanding#se...
        
           | sergiotapia wrote:
           | this is very cool!
        
         | IncreasePosts wrote:
         | Why do they do post training instead of just delegating
         | segmentation to a smaller/purpose-built model?
        
           | thegeomaster wrote:
           | Post-training allows leveraging the considerable world and
           | language understanding of the underlying pretrained model.
           | Intuition is that this would be a boost to performance.
        
       | svat wrote:
       | Thanks for this post; it's inspiring -- for a personal project
       | I'm trying just to get bounding boxes from scanned PDF pages
       | (around paragraphs/verses/headings etc), and so far did not get
       | great results. (It seems to recognize the areas but then the
       | boxes are offset/translated by some amount.) I only just got
       | started and haven't looked closely yet (I'm sure the results can
       | be improved, looking at this post), but I can already see that
       | there are a bunch of things to explore:
       | 
       | - Do you ask the multimodal LLM to return the image with boxes
       | drawn on it (and then somehow extract coordinates), or simply ask
       | it to return the coordinates? (Is the former even possible?)
       | 
       | - Does it better or worse when you ask it for [xmin, xmax, ymin,
       | ymax] or [x, y, width, height] (or various permutations thereof)?
       | 
       | - Do you ask for these coordinates as integer pixels (whose
       | meaning can vary with dimensions of the original image), or
       | normalized between 0.0 and 1.0 (or 0-1000 as in this post)?
       | 
       | - Is it worth doing it in two rounds: send it back its initial
       | response with the boxes drawn on it, to give it another
       | opportunity to "see" its previous answer and adjust its
       | coordinates?
       | 
       | I ought to look at these things, but wondering: as you (or
       | others) work on something like this, how do you keep track of
       | which prompts seem to be working better? Do you log all requests
       | and responses / scores as you go? I didn't do that for my initial
       | attempts, and it feels a bit like shooting in the dark / trying
       | random things until something works.
        
         | pkilgore wrote:
         | The model is seeming trained to pick up on the existence of the
         | words "bounding box" or "segmentation mask" and if so is pre-
         | trained to return Array<{ box_2d: [number, number, number,
         | number], label: string>, mask: "base64/png"}>, where
         | [y0,x0,y1,x1] for bounding box if you ask it for JSON too.
         | 
         | Recommend the Gemini docs here, they are implicit on some of
         | these points.
         | 
         | Prompts matter too, less is more.
         | 
         | And you need to submit images to get good bounding boxes. You
         | can somewhat infer this from the token counts, but Gemini APIs
         | do something to PDFs (OCR, I assume) that cause them to lose
         | complete location context on the page. If you send the page in
         | as an image, that context isn't lost and the boxes are great.
         | 
         | As an example of this, you can send a PDF page with half of the
         | page text, the bottom half empty. If you ask it to draw a
         | bounding box around the last paragraph it tends to return a
         | result that is much higher number on the normalized scale
         | (lower on the y axis) than it should be. In one experiment I
         | did, it would think a footer text that was actually about 2/3
         | down the page was all the way at the end. When I sent as an
         | image, it had in around the 660 mark on the normalized 1000
         | scale exactly where you would expect it.
        
           | mdda wrote:
           | You've got to be careful with PDFs : We can't see how they
           | are rendered internally for the LLM, so there may be
           | differences in how it's treating the margin/gutters/bleeds
           | that we should account for (and cannot).
        
       | Alifatisk wrote:
       | I might be completely off here but it kinda feels like Multimodal
       | LLMs is our silver bullet to applying different technological
       | solutions? From text analysis, to video generation to bounding
       | boxes, it's kinda incredible!
       | 
       | And hopefully with diffusion based llms, we might even see real-
       | time appliances?
        
       | bee_rider wrote:
       | I wonder how the power consumption compares. I'd expect the
       | classic CNN to be cheaper just because it is more specialized.
       | 
       | > The allure of skipping dataset collection, annotation, and
       | training is too enticing not to waste a few evenings testing.
       | 
       | How's annotation work? Do you actually have to mark every pixel
       | of "the thing," or does the training process just accept images
       | with "a thing" inside it, and learn to ignore all the "not the
       | thing" stuff that tends to show up. If it is the latter, maybe
       | Gemini with it's mediocre bounding boxes could be used as an
       | infinitely un-bore-able annotater instead.
        
         | joelthelion wrote:
         | If it works, you could use the llms for the first few thousand
         | cases, then use these annotations to train an efficient
         | supervised model, and switch to that.
         | 
         | That way it would be both efficient and cost-effective.
        
           | bee_rider wrote:
           | It always is fraught to make analogies between human brains
           | and these learning models, but this sounds a bit like muscle
           | memory or something.
        
       | nolok wrote:
       | Not directly related but still kind of, I've more or less settled
       | on Gemini lately and often use it "for fun", not to do the task
       | but see if it could do it better than me and or in novel or
       | efficient way. NotebookLM and Canvas work nicely and it felt easy
       | to use.
       | 
       | I've been absurdly surprised at how good it is at things, and how
       | bad it is at others, and notably that the thing it seems the
       | worst at are the easy picking parts.
       | 
       | Let me give an exemple; I was checking with it the payslip of my
       | employees for the last few months, various wires related to their
       | salaries and the various taxes, and my social declaration papers
       | for labor taxes (which in France are very numerous and complex to
       | follow).I had found a discrepency in a couple of declaration that
       | ultimately led to a few dozen euros losts over some back and
       | forth. Figuring it out by myself took me a while, and was not
       | fun; I had the right accounting total and almost everything was
       | okay, and ultimately it was a case of a credit being applied
       | while an unrelated malus was also applied, both to some employees
       | but not others, and the collision meant a pain to find.
       | 
       | Providing all the papers to gemini and asking it to check if
       | everything was fine, it found me a bazillion "weird things", all
       | mostly correct but worth checking, but not the real problem.
       | 
       | Giving it the same papers, telling him the problem I had and
       | where to look without being sure, it found it for me with decent
       | details, making me confident that next time I can use it not to
       | solve it, but to be put on the right track much much faster than
       | without gemini.
       | 
       | Giving it the same papers, the problem but also the solution I
       | had but asking it to give me more details, again provided great
       | result and actually helped me clarified which lines collided in
       | which order, again not a replacement but a great add on.
       | Definitely felt like the price I'm paying for it is worth it.
       | 
       | But here is the funny part : in all of those great analysis, it
       | kept trying to tally me totals, and there was always one wrong.
       | We're not talking impressive stuff here, but quite literal case
       | of here is a 2 column 5 rows table of data, and here is the
       | total, and the total is wrong, and I needed to ask it like 3 or 4
       | times in a row to fix its total until it agreed / found its issue
       | (which was, literally).
       | 
       | Despite being a bit amused (and intrigued) at the "show thinking"
       | detail of that, where I saw it do the same calculation in half a
       | dozen different way to try and find how I came up with my number,
       | it really showed to me how weirdly different from us those thing
       | work (or "think", some would say).
       | 
       | It it's not thinking but just emergent behavior for text
       | assimilation, which it's supposed to be, then it figuring it
       | something like that in such details and clarity was impressive in
       | a way I can't quite grasp. But if it's not that but a genuine
       | thought process of some sort, how could he miss so many time the
       | simplest thing beside being told.
       | 
       | I don't really have a point here, other than I used to know where
       | I sat on "are the models thinking or not" and the waters have
       | really been murkied for me lately.
       | 
       | There have been lots of talk about these things replacing
       | employees or not, and I don't see how they could, but I also
       | don't see how an employee without one could compete with one
       | helped by one as an assistant; "throw ideas at me" or "here is
       | the result I already know but help me figure out why". That's
       | where they shine very brightly for me.
        
       | chrismorgan wrote:
       | > _Hover or tap to switch between ground truth (green) and Gemini
       | predictions (blue) bounding boxes_
       | 
       | > _Sometimes Gemini is better than the ground truth_
       | 
       | That ain't _ground truth_ , that's just what MS-COCO has.
       | 
       | See also https://en.wikipedia.org/wiki/Ground_truth.
        
         | Cubre wrote:
         | Are you implying it "ain't" ground truth because it's not
         | perfect? Ground truth is simply a term used in machine learning
         | to denote a dataset's labels. A quote extracted from the link
         | that you sent acknowledges that ground truth may not be
         | perfect: "inaccuracies in the ground truth will correlate to
         | inaccuracies in the resulting spam/non-spam verdicts".
        
           | chrismorgan wrote:
           | Tell me with a straight face that the car labeling is okay.
           | It's clearly been made by a dodgy automated system, with no
           | human confirmation of correctness. That ain't ground truth.
        
             | ajcp wrote:
             | You're conflating "truthiness" with "correctness". I
             | realize this sounds like an oxymoron when talking about
             | something called ground "truth", but when we're building
             | ground truth to measure how good our model outputs are, it
             | does not matter what is "true", rather what is "correct".
             | 
             | Our ground truth should reflect the "correct" output
             | expected of the model in regards to it's training. So while
             | in many cases "truth" and "correct" should algin, there are
             | many many cases where "truth" is subjective, and so we must
             | settle for "correct".
             | 
             | Case in point: we've trained a model to parse out addresses
             | from a wide-array of forms. Here is an example address as
             | it would appear on the form.
             | 
             | Address: J Smith 123 Example St
             | 
             | City: LA State: CA Zip: 85001
             | 
             | Our ground truth says it should be rendered as such:
             | 
             | Address Line 1: J Smith
             | 
             | Address Line 2: 123 Example St
             | 
             | City: LA
             | 
             | State: CA
             | 
             | ZipCode: 85001
             | 
             | However our model outputs it thusly:
             | 
             | Address Line 1: J Smith 123 Example St
             | 
             | Address Line 2:
             | 
             | City: LA
             | 
             | State: CA
             | 
             | ZipCode: 85001
             | 
             | That may be _true_ , as there is only 1 address line and we
             | have a field for "Address Line 1", but it is not _correct_.
             | Sure, there may be a problem with our taxonomy, training
             | data, or any other number of other things, but as far as
             | ground truth goes it is not correct.
        
               | chrismorgan wrote:
               | I fail to see how your example is applicable.
               | 
               | Are you trying to tell me that the COCO labelling of the
               | cars is what you call _correct_?
        
               | ajcp wrote:
               | I'm trying to help you understand what "ground truth"
               | means.
               | 
               | If, as it seems in the article, they are using COCO to
               | establish ground truth, i.e. what COCO says is correct,
               | then whatever COCO comes up with is, by definition
               | "correct". It is, in effect, the answer, the measuring
               | stick, the scoring card. Now what you're hinting at is
               | that, in this instance, that's a really bad way to
               | establish ground truth. I agree. But that doesn't change
               | what is and how we use ground truth.
               | 
               | Think of it another way:
               | 
               | - Your job is to pass a test.
               | 
               | - To pass a test you must answer a question correctly.
               | 
               | - The answer to that question has already been written
               | down somewhere.
               | 
               | To pass the test does your answer need to be true, or
               | does it need to match what is already written down?
               | 
               | When we do model evaluation the answer needs to match
               | what is already written down.
        
               | ghurtado wrote:
               | You're trying so hard not to learn something new in this
               | thread, that it's almost impressive.
        
       | smus wrote:
       | We benchmarked Gemini 2.5 on 100 open source object detection
       | datasets in our paper: https://arxiv.org/abs/2505.20612 (see
       | table 2)
       | 
       | Notably, performance on out of distribution data like those in
       | RF100VL is super degraded
       | 
       | It worked really well zero-shot (comparatively to the foundation
       | model field) achieving 13.3 average mAP, but counterintuitively
       | performance degraded when provided visual examples to ground its
       | detections from, and when provided textual instructions on how to
       | find objects as additional context. So it seems it has some
       | amount of object detection zero-shot training, probably on a few
       | standard datasets, but isn't smart enough to incorporate
       | additional context or its general world knowledge into those
       | detection abilities
        
       | pkilgore wrote:
       | I wish temperature was a dimension. I believe the Gemini docs
       | even recommend avoiding t=0 to avoid the kinds of spirals the
       | author was talking about with masks.
        
       | sly010 wrote:
       | Genuine question: How does this work? How does an LLM do object
       | detection? Or more generally, how does an LLM do anything that is
       | not text? I always thought tasks like this are usually just
       | handed to an other (i.e. vision) model, but the post talks about
       | it as if it's the _same_ model doing both text generation and
       | vision. It doesn't make sense to me why would Gemini 2 and 2.5
       | would have different vision capabilities, shouldn't they both
       | have access to the same, purpose trained state of the art vision
       | model?
        
         | Legend2440 wrote:
         | It used to be done that way, but newer multimodal LLMs train on
         | a mix of image and text tokens, so they don't need a separate
         | image encoder. There is just one model that handles everything.
        
         | sashank_1509 wrote:
         | You tokenize the image and then pass it through a vision
         | encoder that is generally trained separately from large scale
         | pretraining (using say contrastive captioning) and then added
         | to the model during RLHF. I'm not surprised if the vision
         | encoder is used in pre training now too, this will be a
         | different objective than next token prediction of course
         | (unless they use something like next token prediction for
         | images which I don't think is the case).
         | 
         | Different models have different encoders, they are not shared
         | as the datasets across models and even model sizes vary. So
         | performance between models will vary.
         | 
         | What you seem to be thinking is that text models were simply
         | calling an API to a vision model, similar to tool-use. That is
         | not what's happening, it is much more inbuilt, the forward pass
         | is going through the vision architecture to the language
         | architecture. Robotics research has been doing this for a
         | while.
        
         | Cheer2171 wrote:
         | tokens are tokens
        
         | simonw wrote:
         | > I always thought tasks like this are usually just handed to
         | an other (i.e. vision) model, but the post talks about it as if
         | it's the _same_ model doing both text generation and vision.
         | 
         | Most vision LLMs don't actually use a separate vision model.
         | https://huggingface.co/blog/vlms is a decent explanation of
         | what's going on.
         | 
         | Most of the big LLMs these days are vision LLMs - the Claude
         | models, the OpenAI models, Grok and most of the Gemini models
         | all accept images in addition to text. To my knowledge none of
         | them are using tool calling to a separate vision model for
         | this.
         | 
         | Some of the local models can do this too - Mistral Small and
         | Gemma 3 are two examples. You can tell they're not tool calling
         | to anything because they run directly out of a single model
         | weights file.
        
           | gylterud wrote:
           | Not a contradiction to anything you said, but O3 will
           | sometimes whip up a python script to analyse the pictures I
           | give it.
           | 
           | For instance, I asked it to compute the symmetry group of a
           | pattern I found on a wallpaper in a Lebanese restaurant this
           | weekend. It realised it was unsure of the symmetries and used
           | a python script to rotate and mirror the pattern and compare
           | to the original to check the symmetries it suspected. Pretty
           | awesome!
        
       | aae42 wrote:
       | i find these discussions comparing the "vision language models"
       | to the old computer vision tech pretty interesting
       | 
       | since there are still strengths the computer vision has, i wonder
       | why someone hasn't made an "uber vision language service" that
       | just exposes the old CV APIs as MCP or something, and have both
       | systems work in conjunction to increase accuracy and
       | understanding
        
       | xrendan wrote:
       | One thing that has surprised me (and I should've known that it
       | wasn't great at it), but it is terrible at creating bounding
       | boxes around things it's not trained on (like bounding parts on a
       | PCB schematic.)
        
         | amelius wrote:
         | So this tells us that it does not _understand_ what it is
         | doing, really. No real intelligence here. Might as well use an
         | old-school YOLO network for the task.
        
           | ta8645 wrote:
           | It's just behaving like a child. A child could draw a
           | bounding box around a dog and a cat, but would fail if you
           | told them to draw a box around the transistors of a PCB. They
           | have no idea what a transistor is, or what it looks like.
           | They lack the knowledge and maturity. But you would never
           | claim the child doesn't _understand_ what they're doing, at
           | least not to imply that they're forever incapable of the
           | task.
        
             | amelius wrote:
             | Yeah, but a child does one-shot learning much better. Just
             | tell it to find the black rectangles and it will draw boxes
             | around the transistors of a PCB, no extra training
             | required.
        
               | ta8645 wrote:
               | Perhaps. But I think you'll find there are a lot of black
               | rectangles on a PCB that aren't actually transistors.
               | You'll end up having to teach the child a lot more if you
               | want accurate results. And that's the same kind of
               | training you'll have to give to an LLM.
               | 
               | In either case, your assertion that one _understands_,
               | and the other doesn't, seems like motivated reasoning,
               | rather than identifying something fundamental about the
               | situation.
        
               | amelius wrote:
               | I mean, problem solving with loose specs is always going
               | to be messy.
               | 
               | But at least with a child I can quickly teach it to
               | follow simple orders, while this AI requires hours of
               | annotating + training, even for simple changes in
               | instructions.
        
               | ta8645 wrote:
               | Humans are the beneficiaries of millions of years of
               | evolution, and are born with innate pattern matching
               | abilities that we don't need "training" for; essentially
               | our pre-training. Of course, it is superior to the
               | current generation of LLMs, but is it fundamentally
               | different? I don't know one way or the other to be
               | honest, but judging from how amazing LLMs are given all
               | their limitations and paucity of evolution, I wouldn't
               | bet against it.
               | 
               | The other problem with LLMs today, is that they don't
               | persist any learning they do from their everyday
               | inference and interaction with users; at least not in
               | real-time. So it makes them harder to instruct in a
               | useful way.
               | 
               | But it seems inevitable that both their pre-training, and
               | ability to seamlessly continue to learn afterward, should
               | improve over the coming years.
        
               | graemep wrote:
               | Then you explain transistors have three wires coming of
               | them.
        
       | mkagenius wrote:
       | Oh yes, its been good for a while. When we created our Android-
       | use[1] (like computer use) tool, it was the cheapest and the best
       | option among Openai, Claude, llama etc.
       | 
       | We have a planner phase followed by a "finder" phase where vision
       | models are used. Following is the summary of our findings for
       | planner and finder. Some of them are "work in progress" as they
       | do not support tool calling (or are extremely bad at tool
       | calling).
       | +------------------------+------------------+------------------+
       | | Models                 | Planner          | Finder           |
       | +------------------------+------------------+------------------+
       | | Gemini 1.5 Pro         | recommended      | recommended      |
       | | Gemini 1.5 Flash       | can use          | recommended      |
       | | Openai GPT 4o          | recommended      | work in progress |
       | | Openai GPT 4o mini     | recommended      | work in progress |
       | | llama 3.2 latest       | work in progress | work in progress |
       | | llama 3.2 vision       | work in progress | work in progress |
       | | Molmo 7B-D-4bit        | work in progress | recommended      |
       | +------------------------+------------------+------------------+
       | 
       | 1. https://github.com/BandarLabs/clickclickclick
        
       | fzysingularity wrote:
       | This isn't surprising at all - most VLMs today are quite poor on
       | localization even though they've been explicitly post-trained on
       | object detection tasks.
       | 
       | One insight that the author calls out is the inconsistencies in
       | coordinate systems used in post-training these - you can't just
       | swap models and get similar results. Gemini uses (ymin, xmin,
       | ymax, xmax) integers b/w 0-1000. Qwen uses (xmin, ymin, xmax,
       | ymax) floats b/w 0-1. We've been evaluating most of the frontier
       | models for bounding boxes / segmentation masks, and this is quite
       | a footgun to new users.
       | 
       | One of the reasons we chose to delegate object-detection to
       | specialized tools is essentially due to the poor performance
       | (~0.34 mAP w/ Gemini to 0.6 mAP w/ DETR like architectures).
       | Check out this cookbook [1] we recently released, we use any LLM
       | to delegate tasks like object-detection, face-detection and other
       | classical CV tasks to a specialized model while still giving the
       | user the dev-ex of a VLM.
       | 
       | [1] https://colab.research.google.com/github/vlm-run/vlmrun-
       | cook...
        
       | muxamilian wrote:
       | I'm rather puzzled by how bad the COCO ground truth is. This is
       | the benchmark dataset for object detection? Wow. I would say
       | Gemini's output is better than the ground truth in most of the
       | example images.
        
       | mehulashah wrote:
       | Cool post. We did a similar evaluation for document segmentation
       | using the DocLayNet benchmark from IBM:
       | https://ds4sd.github.io/icdar23-doclaynet/task/ but on modern
       | document OCR models like Mistral, OpenAI, and Gemini. And what do
       | you know, we found similar performance -- DETR-based segmentation
       | models are about 2x better.
       | 
       | Disclosure: I work for https://aryn.ai/
        
       ___________________________________________________________________
       (page generated 2025-07-10 23:00 UTC)