[HN Gopher] Is Gemini 2.5 good at bounding boxes?
___________________________________________________________________
Is Gemini 2.5 good at bounding boxes?
Author : simedw
Score : 248 points
Date : 2025-07-10 12:35 UTC (10 hours ago)
(HTM) web link (simedw.com)
(TXT) w3m dump (simedw.com)
| EconomistFar wrote:
| Really interesting piece, the bit about tight vs loose bounding
| boxes got me thinking. Small inaccuracies can add up fast,
| especially in edge cases or when training on limited data.
|
| Has anyone here found good ways to handle bounding box quality in
| noisy datasets? Do you rely more on human annotation or clever
| augmentation?
| simedw wrote:
| Thank you! Better training data is often the key to solving
| these issues, though it can be a costly solution.
|
| In some cases, running a model like SAM 2 on a loose bounding
| box can help refine the results. I usually add about 10%
| padding in each direction to the bounding box, just in case the
| original was too tight. Then if you don't actually need to mask
| you just convert it back to a bounding box.
| steinvakt2 wrote:
| I actually did this in my paper:
| https://scholar.google.com/scholar?cluster=14980420937479044...
| serjester wrote:
| I wrote a similar article a couple of months ago, but focusing
| instead on PDF bounding boxes--specifically, drawing boxes around
| content excerpts.
|
| Gemini is really impressive at these kinds of object detection
| tasks.
|
| https://www.sergey.fyi/articles/using-gemini-for-precise-cit...
| simedw wrote:
| That's really interesting, thanks for sharing!
|
| Are you using that approach in production for grounding when
| PDFs don't include embedded text, like in the case of scanned
| documents? I did some experiments for that use case, and it
| wasn't really reaching the bar I was hoping for.
| serjester wrote:
| Yes, this was completely image-based. Not quite of a point of
| using it in production since I agree it can be flakey at
| times. Although I do think there's viable workarounds, like
| sending the same prompt multiple times, and seeing if the
| returned results overlap.
|
| It really feels like we're maybe half a model generation away
| from this being a solved problem.
| svat wrote:
| Thanks for this post -- I'm doing something similar for a
| personal/hobby project (just trying to work with very old
| scanned PDFs in Sanskrit etc), and the bounding box next to
| "Sub-TOI" in your screenshot
| (https://www.sergey.fyi/images/bboxes/annotated-filing.webp) is
| like something I'm encountering too: it clearly "knows" that
| there is a box of a certain width and height, but somehow the
| box is offset from its actual location. Do you have any
| insights into that kind of thing, and did anything you try fix
| that?
| serjester wrote:
| I suspect this is a remnant of how images get tokenized -
| simplest solution is probably to increase the buffer.
| svat wrote:
| What buffer are you referring to / how do you increase it?
| And did that solution work for you (if you happened to
| try)?
| thegeomaster wrote:
| A detail that is not mentioned is that Google models >= Gemini
| 2.0 are all explicitly post-trained for this task of bounding box
| detection: https://ai.google.dev/gemini-api/docs/image-
| understanding
|
| Given that the author is using the specific `box_2d` format, it
| suggests that he is taking advantage of this feature, so I wanted
| to highlight it. My intuition is that a base multimodal LLM
| without this type of post-training would have much worse
| performance.
| simedw wrote:
| That's true, it's also why I didn't benchmark against any other
| model provider.
|
| It has been tuned so heavily on this specific format that even
| a tiny change, like switching the order in the `box_2d` format
| from `(ymin, xmin, ymax, xmax)` to `(xmin, ymin, xmax, ymax)`
| causes performance to tank.
| pbhjpbhj wrote:
| That's interesting because it suggests the meaning and
| representation are very tightly linked; I would expect it to
| be less tightly coupled given Gemini is multimodal.
| demirbey05 wrote:
| I was really shocked when I first see this but yes it's in
| training data. Not thinking feature.
| xnx wrote:
| It's really impressive what Gemini models can do. Segmentation
| too! https://ai.google.dev/gemini-api/docs/image-
| understanding#se...
| sergiotapia wrote:
| this is very cool!
| IncreasePosts wrote:
| Why do they do post training instead of just delegating
| segmentation to a smaller/purpose-built model?
| thegeomaster wrote:
| Post-training allows leveraging the considerable world and
| language understanding of the underlying pretrained model.
| Intuition is that this would be a boost to performance.
| svat wrote:
| Thanks for this post; it's inspiring -- for a personal project
| I'm trying just to get bounding boxes from scanned PDF pages
| (around paragraphs/verses/headings etc), and so far did not get
| great results. (It seems to recognize the areas but then the
| boxes are offset/translated by some amount.) I only just got
| started and haven't looked closely yet (I'm sure the results can
| be improved, looking at this post), but I can already see that
| there are a bunch of things to explore:
|
| - Do you ask the multimodal LLM to return the image with boxes
| drawn on it (and then somehow extract coordinates), or simply ask
| it to return the coordinates? (Is the former even possible?)
|
| - Does it better or worse when you ask it for [xmin, xmax, ymin,
| ymax] or [x, y, width, height] (or various permutations thereof)?
|
| - Do you ask for these coordinates as integer pixels (whose
| meaning can vary with dimensions of the original image), or
| normalized between 0.0 and 1.0 (or 0-1000 as in this post)?
|
| - Is it worth doing it in two rounds: send it back its initial
| response with the boxes drawn on it, to give it another
| opportunity to "see" its previous answer and adjust its
| coordinates?
|
| I ought to look at these things, but wondering: as you (or
| others) work on something like this, how do you keep track of
| which prompts seem to be working better? Do you log all requests
| and responses / scores as you go? I didn't do that for my initial
| attempts, and it feels a bit like shooting in the dark / trying
| random things until something works.
| pkilgore wrote:
| The model is seeming trained to pick up on the existence of the
| words "bounding box" or "segmentation mask" and if so is pre-
| trained to return Array<{ box_2d: [number, number, number,
| number], label: string>, mask: "base64/png"}>, where
| [y0,x0,y1,x1] for bounding box if you ask it for JSON too.
|
| Recommend the Gemini docs here, they are implicit on some of
| these points.
|
| Prompts matter too, less is more.
|
| And you need to submit images to get good bounding boxes. You
| can somewhat infer this from the token counts, but Gemini APIs
| do something to PDFs (OCR, I assume) that cause them to lose
| complete location context on the page. If you send the page in
| as an image, that context isn't lost and the boxes are great.
|
| As an example of this, you can send a PDF page with half of the
| page text, the bottom half empty. If you ask it to draw a
| bounding box around the last paragraph it tends to return a
| result that is much higher number on the normalized scale
| (lower on the y axis) than it should be. In one experiment I
| did, it would think a footer text that was actually about 2/3
| down the page was all the way at the end. When I sent as an
| image, it had in around the 660 mark on the normalized 1000
| scale exactly where you would expect it.
| mdda wrote:
| You've got to be careful with PDFs : We can't see how they
| are rendered internally for the LLM, so there may be
| differences in how it's treating the margin/gutters/bleeds
| that we should account for (and cannot).
| Alifatisk wrote:
| I might be completely off here but it kinda feels like Multimodal
| LLMs is our silver bullet to applying different technological
| solutions? From text analysis, to video generation to bounding
| boxes, it's kinda incredible!
|
| And hopefully with diffusion based llms, we might even see real-
| time appliances?
| bee_rider wrote:
| I wonder how the power consumption compares. I'd expect the
| classic CNN to be cheaper just because it is more specialized.
|
| > The allure of skipping dataset collection, annotation, and
| training is too enticing not to waste a few evenings testing.
|
| How's annotation work? Do you actually have to mark every pixel
| of "the thing," or does the training process just accept images
| with "a thing" inside it, and learn to ignore all the "not the
| thing" stuff that tends to show up. If it is the latter, maybe
| Gemini with it's mediocre bounding boxes could be used as an
| infinitely un-bore-able annotater instead.
| joelthelion wrote:
| If it works, you could use the llms for the first few thousand
| cases, then use these annotations to train an efficient
| supervised model, and switch to that.
|
| That way it would be both efficient and cost-effective.
| bee_rider wrote:
| It always is fraught to make analogies between human brains
| and these learning models, but this sounds a bit like muscle
| memory or something.
| nolok wrote:
| Not directly related but still kind of, I've more or less settled
| on Gemini lately and often use it "for fun", not to do the task
| but see if it could do it better than me and or in novel or
| efficient way. NotebookLM and Canvas work nicely and it felt easy
| to use.
|
| I've been absurdly surprised at how good it is at things, and how
| bad it is at others, and notably that the thing it seems the
| worst at are the easy picking parts.
|
| Let me give an exemple; I was checking with it the payslip of my
| employees for the last few months, various wires related to their
| salaries and the various taxes, and my social declaration papers
| for labor taxes (which in France are very numerous and complex to
| follow).I had found a discrepency in a couple of declaration that
| ultimately led to a few dozen euros losts over some back and
| forth. Figuring it out by myself took me a while, and was not
| fun; I had the right accounting total and almost everything was
| okay, and ultimately it was a case of a credit being applied
| while an unrelated malus was also applied, both to some employees
| but not others, and the collision meant a pain to find.
|
| Providing all the papers to gemini and asking it to check if
| everything was fine, it found me a bazillion "weird things", all
| mostly correct but worth checking, but not the real problem.
|
| Giving it the same papers, telling him the problem I had and
| where to look without being sure, it found it for me with decent
| details, making me confident that next time I can use it not to
| solve it, but to be put on the right track much much faster than
| without gemini.
|
| Giving it the same papers, the problem but also the solution I
| had but asking it to give me more details, again provided great
| result and actually helped me clarified which lines collided in
| which order, again not a replacement but a great add on.
| Definitely felt like the price I'm paying for it is worth it.
|
| But here is the funny part : in all of those great analysis, it
| kept trying to tally me totals, and there was always one wrong.
| We're not talking impressive stuff here, but quite literal case
| of here is a 2 column 5 rows table of data, and here is the
| total, and the total is wrong, and I needed to ask it like 3 or 4
| times in a row to fix its total until it agreed / found its issue
| (which was, literally).
|
| Despite being a bit amused (and intrigued) at the "show thinking"
| detail of that, where I saw it do the same calculation in half a
| dozen different way to try and find how I came up with my number,
| it really showed to me how weirdly different from us those thing
| work (or "think", some would say).
|
| It it's not thinking but just emergent behavior for text
| assimilation, which it's supposed to be, then it figuring it
| something like that in such details and clarity was impressive in
| a way I can't quite grasp. But if it's not that but a genuine
| thought process of some sort, how could he miss so many time the
| simplest thing beside being told.
|
| I don't really have a point here, other than I used to know where
| I sat on "are the models thinking or not" and the waters have
| really been murkied for me lately.
|
| There have been lots of talk about these things replacing
| employees or not, and I don't see how they could, but I also
| don't see how an employee without one could compete with one
| helped by one as an assistant; "throw ideas at me" or "here is
| the result I already know but help me figure out why". That's
| where they shine very brightly for me.
| chrismorgan wrote:
| > _Hover or tap to switch between ground truth (green) and Gemini
| predictions (blue) bounding boxes_
|
| > _Sometimes Gemini is better than the ground truth_
|
| That ain't _ground truth_ , that's just what MS-COCO has.
|
| See also https://en.wikipedia.org/wiki/Ground_truth.
| Cubre wrote:
| Are you implying it "ain't" ground truth because it's not
| perfect? Ground truth is simply a term used in machine learning
| to denote a dataset's labels. A quote extracted from the link
| that you sent acknowledges that ground truth may not be
| perfect: "inaccuracies in the ground truth will correlate to
| inaccuracies in the resulting spam/non-spam verdicts".
| chrismorgan wrote:
| Tell me with a straight face that the car labeling is okay.
| It's clearly been made by a dodgy automated system, with no
| human confirmation of correctness. That ain't ground truth.
| ajcp wrote:
| You're conflating "truthiness" with "correctness". I
| realize this sounds like an oxymoron when talking about
| something called ground "truth", but when we're building
| ground truth to measure how good our model outputs are, it
| does not matter what is "true", rather what is "correct".
|
| Our ground truth should reflect the "correct" output
| expected of the model in regards to it's training. So while
| in many cases "truth" and "correct" should algin, there are
| many many cases where "truth" is subjective, and so we must
| settle for "correct".
|
| Case in point: we've trained a model to parse out addresses
| from a wide-array of forms. Here is an example address as
| it would appear on the form.
|
| Address: J Smith 123 Example St
|
| City: LA State: CA Zip: 85001
|
| Our ground truth says it should be rendered as such:
|
| Address Line 1: J Smith
|
| Address Line 2: 123 Example St
|
| City: LA
|
| State: CA
|
| ZipCode: 85001
|
| However our model outputs it thusly:
|
| Address Line 1: J Smith 123 Example St
|
| Address Line 2:
|
| City: LA
|
| State: CA
|
| ZipCode: 85001
|
| That may be _true_ , as there is only 1 address line and we
| have a field for "Address Line 1", but it is not _correct_.
| Sure, there may be a problem with our taxonomy, training
| data, or any other number of other things, but as far as
| ground truth goes it is not correct.
| chrismorgan wrote:
| I fail to see how your example is applicable.
|
| Are you trying to tell me that the COCO labelling of the
| cars is what you call _correct_?
| ajcp wrote:
| I'm trying to help you understand what "ground truth"
| means.
|
| If, as it seems in the article, they are using COCO to
| establish ground truth, i.e. what COCO says is correct,
| then whatever COCO comes up with is, by definition
| "correct". It is, in effect, the answer, the measuring
| stick, the scoring card. Now what you're hinting at is
| that, in this instance, that's a really bad way to
| establish ground truth. I agree. But that doesn't change
| what is and how we use ground truth.
|
| Think of it another way:
|
| - Your job is to pass a test.
|
| - To pass a test you must answer a question correctly.
|
| - The answer to that question has already been written
| down somewhere.
|
| To pass the test does your answer need to be true, or
| does it need to match what is already written down?
|
| When we do model evaluation the answer needs to match
| what is already written down.
| ghurtado wrote:
| You're trying so hard not to learn something new in this
| thread, that it's almost impressive.
| smus wrote:
| We benchmarked Gemini 2.5 on 100 open source object detection
| datasets in our paper: https://arxiv.org/abs/2505.20612 (see
| table 2)
|
| Notably, performance on out of distribution data like those in
| RF100VL is super degraded
|
| It worked really well zero-shot (comparatively to the foundation
| model field) achieving 13.3 average mAP, but counterintuitively
| performance degraded when provided visual examples to ground its
| detections from, and when provided textual instructions on how to
| find objects as additional context. So it seems it has some
| amount of object detection zero-shot training, probably on a few
| standard datasets, but isn't smart enough to incorporate
| additional context or its general world knowledge into those
| detection abilities
| pkilgore wrote:
| I wish temperature was a dimension. I believe the Gemini docs
| even recommend avoiding t=0 to avoid the kinds of spirals the
| author was talking about with masks.
| sly010 wrote:
| Genuine question: How does this work? How does an LLM do object
| detection? Or more generally, how does an LLM do anything that is
| not text? I always thought tasks like this are usually just
| handed to an other (i.e. vision) model, but the post talks about
| it as if it's the _same_ model doing both text generation and
| vision. It doesn't make sense to me why would Gemini 2 and 2.5
| would have different vision capabilities, shouldn't they both
| have access to the same, purpose trained state of the art vision
| model?
| Legend2440 wrote:
| It used to be done that way, but newer multimodal LLMs train on
| a mix of image and text tokens, so they don't need a separate
| image encoder. There is just one model that handles everything.
| sashank_1509 wrote:
| You tokenize the image and then pass it through a vision
| encoder that is generally trained separately from large scale
| pretraining (using say contrastive captioning) and then added
| to the model during RLHF. I'm not surprised if the vision
| encoder is used in pre training now too, this will be a
| different objective than next token prediction of course
| (unless they use something like next token prediction for
| images which I don't think is the case).
|
| Different models have different encoders, they are not shared
| as the datasets across models and even model sizes vary. So
| performance between models will vary.
|
| What you seem to be thinking is that text models were simply
| calling an API to a vision model, similar to tool-use. That is
| not what's happening, it is much more inbuilt, the forward pass
| is going through the vision architecture to the language
| architecture. Robotics research has been doing this for a
| while.
| Cheer2171 wrote:
| tokens are tokens
| simonw wrote:
| > I always thought tasks like this are usually just handed to
| an other (i.e. vision) model, but the post talks about it as if
| it's the _same_ model doing both text generation and vision.
|
| Most vision LLMs don't actually use a separate vision model.
| https://huggingface.co/blog/vlms is a decent explanation of
| what's going on.
|
| Most of the big LLMs these days are vision LLMs - the Claude
| models, the OpenAI models, Grok and most of the Gemini models
| all accept images in addition to text. To my knowledge none of
| them are using tool calling to a separate vision model for
| this.
|
| Some of the local models can do this too - Mistral Small and
| Gemma 3 are two examples. You can tell they're not tool calling
| to anything because they run directly out of a single model
| weights file.
| gylterud wrote:
| Not a contradiction to anything you said, but O3 will
| sometimes whip up a python script to analyse the pictures I
| give it.
|
| For instance, I asked it to compute the symmetry group of a
| pattern I found on a wallpaper in a Lebanese restaurant this
| weekend. It realised it was unsure of the symmetries and used
| a python script to rotate and mirror the pattern and compare
| to the original to check the symmetries it suspected. Pretty
| awesome!
| aae42 wrote:
| i find these discussions comparing the "vision language models"
| to the old computer vision tech pretty interesting
|
| since there are still strengths the computer vision has, i wonder
| why someone hasn't made an "uber vision language service" that
| just exposes the old CV APIs as MCP or something, and have both
| systems work in conjunction to increase accuracy and
| understanding
| xrendan wrote:
| One thing that has surprised me (and I should've known that it
| wasn't great at it), but it is terrible at creating bounding
| boxes around things it's not trained on (like bounding parts on a
| PCB schematic.)
| amelius wrote:
| So this tells us that it does not _understand_ what it is
| doing, really. No real intelligence here. Might as well use an
| old-school YOLO network for the task.
| ta8645 wrote:
| It's just behaving like a child. A child could draw a
| bounding box around a dog and a cat, but would fail if you
| told them to draw a box around the transistors of a PCB. They
| have no idea what a transistor is, or what it looks like.
| They lack the knowledge and maturity. But you would never
| claim the child doesn't _understand_ what they're doing, at
| least not to imply that they're forever incapable of the
| task.
| amelius wrote:
| Yeah, but a child does one-shot learning much better. Just
| tell it to find the black rectangles and it will draw boxes
| around the transistors of a PCB, no extra training
| required.
| ta8645 wrote:
| Perhaps. But I think you'll find there are a lot of black
| rectangles on a PCB that aren't actually transistors.
| You'll end up having to teach the child a lot more if you
| want accurate results. And that's the same kind of
| training you'll have to give to an LLM.
|
| In either case, your assertion that one _understands_,
| and the other doesn't, seems like motivated reasoning,
| rather than identifying something fundamental about the
| situation.
| amelius wrote:
| I mean, problem solving with loose specs is always going
| to be messy.
|
| But at least with a child I can quickly teach it to
| follow simple orders, while this AI requires hours of
| annotating + training, even for simple changes in
| instructions.
| ta8645 wrote:
| Humans are the beneficiaries of millions of years of
| evolution, and are born with innate pattern matching
| abilities that we don't need "training" for; essentially
| our pre-training. Of course, it is superior to the
| current generation of LLMs, but is it fundamentally
| different? I don't know one way or the other to be
| honest, but judging from how amazing LLMs are given all
| their limitations and paucity of evolution, I wouldn't
| bet against it.
|
| The other problem with LLMs today, is that they don't
| persist any learning they do from their everyday
| inference and interaction with users; at least not in
| real-time. So it makes them harder to instruct in a
| useful way.
|
| But it seems inevitable that both their pre-training, and
| ability to seamlessly continue to learn afterward, should
| improve over the coming years.
| graemep wrote:
| Then you explain transistors have three wires coming of
| them.
| mkagenius wrote:
| Oh yes, its been good for a while. When we created our Android-
| use[1] (like computer use) tool, it was the cheapest and the best
| option among Openai, Claude, llama etc.
|
| We have a planner phase followed by a "finder" phase where vision
| models are used. Following is the summary of our findings for
| planner and finder. Some of them are "work in progress" as they
| do not support tool calling (or are extremely bad at tool
| calling).
| +------------------------+------------------+------------------+
| | Models | Planner | Finder |
| +------------------------+------------------+------------------+
| | Gemini 1.5 Pro | recommended | recommended |
| | Gemini 1.5 Flash | can use | recommended |
| | Openai GPT 4o | recommended | work in progress |
| | Openai GPT 4o mini | recommended | work in progress |
| | llama 3.2 latest | work in progress | work in progress |
| | llama 3.2 vision | work in progress | work in progress |
| | Molmo 7B-D-4bit | work in progress | recommended |
| +------------------------+------------------+------------------+
|
| 1. https://github.com/BandarLabs/clickclickclick
| fzysingularity wrote:
| This isn't surprising at all - most VLMs today are quite poor on
| localization even though they've been explicitly post-trained on
| object detection tasks.
|
| One insight that the author calls out is the inconsistencies in
| coordinate systems used in post-training these - you can't just
| swap models and get similar results. Gemini uses (ymin, xmin,
| ymax, xmax) integers b/w 0-1000. Qwen uses (xmin, ymin, xmax,
| ymax) floats b/w 0-1. We've been evaluating most of the frontier
| models for bounding boxes / segmentation masks, and this is quite
| a footgun to new users.
|
| One of the reasons we chose to delegate object-detection to
| specialized tools is essentially due to the poor performance
| (~0.34 mAP w/ Gemini to 0.6 mAP w/ DETR like architectures).
| Check out this cookbook [1] we recently released, we use any LLM
| to delegate tasks like object-detection, face-detection and other
| classical CV tasks to a specialized model while still giving the
| user the dev-ex of a VLM.
|
| [1] https://colab.research.google.com/github/vlm-run/vlmrun-
| cook...
| muxamilian wrote:
| I'm rather puzzled by how bad the COCO ground truth is. This is
| the benchmark dataset for object detection? Wow. I would say
| Gemini's output is better than the ground truth in most of the
| example images.
| mehulashah wrote:
| Cool post. We did a similar evaluation for document segmentation
| using the DocLayNet benchmark from IBM:
| https://ds4sd.github.io/icdar23-doclaynet/task/ but on modern
| document OCR models like Mistral, OpenAI, and Gemini. And what do
| you know, we found similar performance -- DETR-based segmentation
| models are about 2x better.
|
| Disclosure: I work for https://aryn.ai/
___________________________________________________________________
(page generated 2025-07-10 23:00 UTC)