[HN Gopher] O3 beats a master-level GeoGuessr player, even with ...
___________________________________________________________________
O3 beats a master-level GeoGuessr player, even with fake EXIF data
Author : bko
Score : 351 points
Date : 2025-04-29 16:48 UTC (6 hours ago)
(HTM) web link (sampatt.com)
(TXT) w3m dump (sampatt.com)
| orangecat wrote:
| Amazing. I'm relatively bullish on AI and still I would have bet
| on the human here. Looking forward to the inevitable goalpost-
| moving of "that's not real reasoning".
| short_sells_poo wrote:
| Can you please explain to me how this is evidence for
| reasoning?
| z7 wrote:
| Quoting Chollet:
|
| >I have repeatedly said that "can LLM reason?" was the wrong
| question to ask. Instead the right question is, "can they
| adapt to novelty?".
|
| https://x.com/fchollet/status/1866348355204595826
| kelseyfrog wrote:
| Because the output contains evidence of thought processes
| that have been established as leading to valid solutions to
| problems.
|
| I have a simple question: Is text a sufficient medium to
| render a conclusion of reasoning? It can't be sufficient for
| humans and insufficient for computers - such a position is
| indefensible.
| empath75 wrote:
| I would say that almost all of what humans do is not the
| result of reasoning, and that reasoning is an unnatural and
| learned skill for humans, and most humans aren't good at
| even very basic reasoning.
| kelseyfrog wrote:
| Usually we move the goalposts for AI. It takes more guts
| to move the goalposts for humans. I applaud it.
|
| Do you suppose we can deduce reasoning through the medium
| of text?
| zahlman wrote:
| > Because the output contains evidence of thought processes
| that have been established as leading to valid solutions to
| problems.
|
| This sort of claim always just reminds me of Lucky's
| monologue in Waiting for Godot.
| kelseyfrog wrote:
| You're not wrong. It's an artifact of rewriting the
| definition of reason into a sentence that begins with
| "Because the output ..."
| zahlman wrote:
| I didn't mean the wording itself; I meant that the claim
| isn't convincing to me for the same reason that Lucky's
| speech doesn't demonstrate an intelligent speaker.
| s17n wrote:
| Geoguessing isn't much of a reasoning task, its more about
| memorizing a bunch of knowledge. Since LLMs contain essentially
| all knowledge, it's not surprising that they would be good at
| this.
|
| As far as goalpost-moving goes, it's wild to me that nobody is
| talking about the turing test these days.
| distortionfield wrote:
| Because the Chinese Room is a much better analogy for what
| LLMs are doing inside than the Turing test is.
| jibal wrote:
| That's a non sequitur that mixes apples and giraffes, and
| is completely wrong about what happens in the Chinese Room
| and what happens in LLMs. Ex hypothesi, the "rule book"
| that the Searle homunculus in the Chinese Room uses is "the
| right sort of program" to implement "Strong AI". The LLM
| algorithm is very much not that sort of program, it's a
| statistical pattern matcher. Strong AI does symbolic
| reasoning, LLMs do not.
|
| But worse, the Turing Test is not remotely intended to be
| an "analogy for what LLMs are doing inside" so your
| comparison makes no sense whatsoever, and completely fails
| to address the actual point--which is that, for ages the
| Turing Test was held out as the criterion for determining
| whether a system was "thinking", but that has been
| abandoned in the face of LLMs, which have near perfect
| language models and are able to closely model modes of
| human interaction _regardless_ of whether they are
| "thinking" (and they aren't, so the TT is clearly an
| inadequate test, which some argued for decades before LLMs
| became a reality).
| semi-extrinsic wrote:
| > the TT is clearly an inadequate test, which some argued
| for decades before LLMs became a reality
|
| To be specific, in a curious quirk of fate, LLMs seem to
| be proving right much of what Chomsky was saying about
| language.
|
| E.g. in 1996 he described the Turing test as _" although
| highly influential, it seems to me not only foreign to
| the sciences but also close to senseless"_.
|
| (Curious in that VC backed businesses are experimentally
| verifying the views of a prominent anti-capitalist
| socialist.)
| CamperBob2 wrote:
| What happens if we give the operator of the Chinese Room a
| nontrivial math problem, one that can't simply be answered
| with a symbolic lookup but requires the operator to proceed
| step-by-step on a path of inquiry that he doesn't even know
| he's taking?
|
| The analogy I used in another thread is a third grader who
| finds a high school algebra book. She can read the book
| easily, but without access to teachers or background
| material that she can engage with -- consciously,
| literately, and interactively, unlike the Chinese Room
| operator -- she will not be able to answer the exercises in
| the book correctly, the way an LLM can.
| TimorousBestie wrote:
| A lot happens in seventy-five years.
| jibal wrote:
| People were talking about the Turing Test as the criterion
| for whether a system was "thinking" up until the advent of
| LLMs, which was far less than 75 years ago.
| debugnik wrote:
| The whole point of Turing's paper was to show that the
| Test doesn't answer whether a computer thinks, because
| it's a meaningless metric, but instead shows what the
| computer _can do_ , which is much more meaningful.
| zahlman wrote:
| I see this claim asserted frequently, but never with
| evidence. It doesn't match my personal perception.
| bluefirebrand wrote:
| > As far as goalpost-moving goes, it's wild to me that nobody
| is talking about the turing test these days
|
| To be honest I am still not entirely convinced that current
| LLMs pass the turing test consistently, at least not with any
| reasonably skeptical tester
|
| "Reasonably Skeptical Tester" is a bit of goalpost shifting,
| but... Let's be real here.
|
| Most of these LLMs have way too much of a "customer service
| voice", it's not very conversational and I think it is fairly
| easy to identify, especially if you suspect they are an LLM
| and start to probe their behavior
|
| Frankly, if the bar for passing the Turing Test is "it must
| fool some number of low intelligence gullible people" then
| we've had AI for decades, since people have been falling for
| scammy porno bots for a long time
| jibal wrote:
| One needs to be more than "reasonably skeptical" and merely
| not "low intelligence gullible" to be a competent TT judge
| --it requires skill, experience, and understanding an LLM's
| weak spots.
|
| And the "customer service voice" you see is one that is
| intentionally programmed in by the vendors via baseline
| rules. They can be programmed differently--or overridden by
| appropriate prompts--to have a very different tone.
|
| LLMs trained on trillions of human-generated text fragments
| available from the internet have shown that the TT is
| simply not an adequate test for identifying whether a
| machine is "thinking"--which was Turing's original intent
| in his 1950 paper "Computing Machinery and Intelligence" in
| which he introduced the test (which he called "the
| imitation game").
| bluefirebrand wrote:
| It's actually trivial, even with the best LLMs on the
| market:
|
| Try to rapidly change the conversation to a wildly
| different subject
|
| Humans will resist this, or say some final "closing
| comments"
|
| Even the absolute best LLMs will happily go wherever they
| are led, without commenting remotely on topic shifts
|
| Try it out
|
| Edit: This isn't even a terribly contrived example by the
| way. It is an example of how some people with ADHD
| navigate normal conversations sometimes
| shawabawa3 wrote:
| Gemini is pretty good at resisting this
|
| https://aistudio.google.com/app/prompts/1dxV3NoYHo6Mv36uP
| Rjk...
|
| It was doing so well until the last question :rip: but
| it's normal that you can jailbreak a user prompt with
| another user prompt, I think with system prompts it would
| be a lot harder
| darkwater wrote:
| > As far as goalpost-moving goes, it's wild to me that nobody
| is talking about the turing test these days.
|
| Well, in this case humans has to be trained as well but now
| there are humans pretty good at detecting LLM slobs as well.
| (I'm half-joking and half-serious)
| sundarurfriend wrote:
| > As far as goalpost-moving goes, it's wild to me that nobody
| is talking about the turing test these days.
|
| UCSD: Large Language Models Pass the Turing Test
| https://news.ycombinator.com/item?id=43555248
|
| From just a month ago.
| s17n wrote:
| Exactly - maybe the most significant long-term goal in
| computer science history has been achieved and it's barely
| discussed.
| Macha wrote:
| Obviously when the Turing Test was designed, the thought was
| that anything that could pass it would so obviously be
| clearly human-like that passing it would be a clear signal.
|
| LLMs really made it clear that it's not so clear cut. And so
| the relevance of the test fell.
| zahlman wrote:
| Look at contemporary accounts of what people thought a
| conversation with a Turing-test-passing machine would look
| like. It's clear they had something very different in mind.
|
| Realizing problems with previous hypotheses about _what might
| make_ a good test, is not the same thing as choosing a
| standard and then revising it when it 's met.
| s17n wrote:
| I think any time a 50+ year old problem is solved, it
| should be considered a Big Deal, regardless of how the
| solution changes our understanding of the original problem.
| TimorousBestie wrote:
| I don't think any goalposts need to be redecorated. The "inner
| monologue" isn't a reliable witness to o3's model, it's at best
| a post-hoc estimation of what a human inner monologue might be
| in this circumstance. So its "testimony" about what it is doing
| is unreliable, and therefore it doesn't move the needle on
| whether or not this is "real reasoning" for some value of that
| phrase.
|
| In short, it's still anthropomorphism and apophenia locked in a
| feedback loop.
| katmannthree wrote:
| Devil's advocate, as with most LLM issues this applies to the
| meatbags that generated the source material as well. Quick
| example is asking someone to describe their favorite music
| and why they like it, and note the probable lack of reasoning
| on the `this is what I listened to as a teenager` axis.
| hombre_fatal wrote:
| Good point. When we try to explain why we're attracted to
| something or someone, what we do seems closer to modeling
| what we like to think about ourself. At the extreme, we're
| just story-telling about an estimation we like to think is
| true.
| TimorousBestie wrote:
| I largely agree! Humans are notoriously bad at doing what
| we call reasoning.
|
| I also agree with the cousin comment that (paraphrased)
| "reasoning is the wrong question, we should be asking about
| how it adapts to novelty." But most cybernetic systems meet
| that bar.
| ewoodrich wrote:
| Something as inherently subjective as personal preference
| doesn't seem like an ideal example to make that point. How
| could you expect to objectively evaluate something like "I
| enjoy songs in a minor scale" or "I hate country"?
| katmannthree wrote:
| The point is to illustrate the disconnect between stated
| reasoning and proximate cause.
|
| Consider your typical country music enjoyer. Their
| fondness of the art, as it were, is far more a function
| of cultural coding during their formative years than a
| deliberate personal choice to savor the melodic twangs of
| a corncob banjo. The same goes for people who like
| classic rock, rap, etc. The people who `hate' country are
| likewise far more likely to do so out of oppositional
| cultural contempt, same as people who hate rap or those
| in the not so distant past who couldn't stand rock &
| roll.
|
| This of course fails to account for higher-agency
| individuals who have developed their musical tastes, but
| that's a relatively small subset of the population at
| large.
| empath75 wrote:
| I don't think the inner monologue is evidence of reasoning at
| all, but doing a task which can only be accomplished by
| reasoning is.
| TimorousBestie wrote:
| Geoguessr is not a task that can only be accomplished by
| reasoning. Famously, it took a less than a day of compute
| time in 2011 to SLAM together a bunch of pictures of Rome
| (https://grail.cs.washington.edu/rome/).
| jibal wrote:
| Such as? geoguessing certainly isn't that.
| red75prime wrote:
| > it's at best a post-hoc estimation of what a human inner
| monologue might be in this circumstance
|
| Nope. It's not autoregressive training on examples of human
| inner monologue. It's reinforcement learning on the results
| of generated chains of thoughts.
| jibal wrote:
| "It's reinforcement learning on the results of generated
| chains of thoughts."
|
| No, that's not how LLMs work.
| red75prime wrote:
| Base models are trained using autoregressive learning.
| "Reasoning models" are base models (maybe with some
| modifications) that were additionally trained using
| reinforcement learning.
| Philpax wrote:
| That is how reasoning models work:
| https://www.interconnects.ai/p/deepseek-r1-recipe-for-o1
| InkCanon wrote:
| I think if your assumption is that AI is deducing where it is
| with rational thoughts, you would be. In truth what probably
| happened is that the significant majority of digital images of
| the world had been scraped, labeled and used as training data.
| oncallthrow wrote:
| How do you explain
| https://simonwillison.net/2025/Apr/26/o3-photo-locations/?
| Rumudiez wrote:
| they only posted one photo in the post, but going off of
| that it's still an easy match based on streetview imagery.
| furthermore, the AI just identified the license plate and
| got lucky that photographer lives in a populous area,
| making it more prominent in the training data and therefore
| more likely to be found (even though it was off by 200
| miles on its first guess)
| simonw wrote:
| I posted two more at the bottom, from Madagascar and
| Buenos Aires:
| https://simonwillison.net/2025/Apr/26/o3-photo-
| locations/#up...
| Philpax wrote:
| Try it with your own photos from around the world. I used my
| own photos from Stockholm, San Francisco, Tvarozna, Saas-Fee,
| London, Bergen, Adelaide, Melbourne, Paris, and Sicily, and
| can confirm that it was within acceptable range for almost
| all of them (without EXIF data), and it absolutely nailed
| some of the more obvious spots.
| SpaceManNabs wrote:
| > Looking forward to the inevitable goalpost-moving of "that's
| not real reasoning"
|
| It did a web lookup.
|
| It is not comparing humans and o3 with equal resources.
| SamPatt wrote:
| That's really not a fair assessment.
|
| It used search in 2 of 5 rounds, and it already knew the
| correct road in one of those rounds (just look at the search
| terms it used).
|
| If you read the chain of thought output, you cannot dismiss
| their capability that easily.
| SpaceManNabs wrote:
| Why is it not a fair assessment to say it is comparing two
| "clients" with different resources if one can do a web
| lookup and the other cannot?
|
| You note yourself that it was meaningful in another round.
|
| > Also, the web search was only meaningful in the Austria
| round. It did use it in the Ireland round too, but as you
| can see by the search terms it used, it already knew the
| road solely from image recognition.
| SamPatt wrote:
| I thought it might matter somewhat in that one Austria
| round. I was incorrect - I re-ran both rounds where the
| model used search, without search this time, and the
| results were nearly identical. I updated the post with
| the details.
|
| That's why I'm saying it's unfair to just claim it's
| doing a web lookup. No, it's way more capable than that.
| SirHumphrey wrote:
| My objection is not "that is not real reasoning" my objection
| is that's not that hard.
|
| I happen to do some geolocating from static images from time to
| time and at least most of the images provided as examples
| contain a lot of clues- enough that i think a semi experienced
| person could figure out the location although - in fairness- in
| a few hours not few minutes.
|
| Second, the similar approaches were tried using CNNs and it
| worked (somewhat)[1].
|
| [1]: https://huggingface.co/geolocal/StreetCLIP
|
| EDIT: I am not talking about geoguesser - i am talking about
| geolocating an image with everything available (e.g. google...)
| usaar333 wrote:
| Why? AI beat rainbolt 1.5 years ago:
| https://www.npr.org/2023/12/19/1219984002/artificial-intelli...
|
| AI tends to have superhuman pattern matching abilities with
| enough data
| karlding wrote:
| If you watch the video, (one of) the reasons why the AI was
| winning was because it was using "meta" information from the
| Street View camera images, and not necessarily because it's
| successfully identifying locations purely based on the
| landmarks in the image.
|
| _> I realized that the AI was using the smudges on the
| camera to help make an educated guess here._
|
| [0] https://youtu.be/ts5lPDV--cU?t=1412
| ApolloFortyNine wrote:
| Pro geoguessr players do the same thing. The vividness of
| the colors and weirdness in the sky are two examples I've
| seen Rainbolt use in the past (and he's not even the best).
| ZeWaka wrote:
| Meta is widely used by humans. One funny one is the
| different hiding-masks for the different streetview cars.
| 1970-01-01 wrote:
| Give it a photo from the surface of Mars and verify if it's
| actually capable of thinking outside the box or if it's relying
| on metadata and text.
|
| https://nssdc.gsfc.nasa.gov/planetary/image/mera_hills.jpg
| SamPatt wrote:
| Done. Here's o3's reply:
|
| >That's not Earth at all--this is the floor of Jezero Crater
| on Mars, the dusty plain and low ridge captured by NASA's
| Perseverance rover (the Mastcam-Z color cameras give away the
| muted tan-pink sky and the uniform basaltic rubble strewn
| across the regolith).
| zahlman wrote:
| > Looking forward to the inevitable goalpost-moving of "that's
| not real reasoning".
|
| How is that moving the goalposts? Where did you see them set
| before, and where did your critics agree to that?
| TimTheTinker wrote:
| > Looking forward to the inevitable goalpost-moving of "that's
| not real reasoning".
|
| It's less about the definition of "reasoning" and more about
| what's _interesting_.
|
| Maybe I'm wrong here ... but a chess bot that wins via a 100%
| game solution stored in exabytes of precomputed data might have
| an interesting internal design (at least the precomputing
| part), but playing against it wouldn't keep on being an
| interesting experience for most people because it always wins
| optimally and there's no real-time reasoning going on (that is,
| unless you're interested in the experience of playing against a
| perfect player). But for most people just interested in playing
| chess, I suspect it would get old quickly.
|
| Now ... if someone followed up with a tool that could explain
| insightfully why any given move (or series) the bot played is
| the best, or showed when two or more moves are equally optimal
| and why, _that_ would be really interesting.
| bongodongobob wrote:
| I tried this the other day with a picture of my dog in a field in
| a park by a river with EXIF stripped. It gave me a list of parks
| on the correct river. There were really no other features other
| than the foliage and water. Seems like magic to me, I don't
| really understand how it's possible to be that accurate.
| fwip wrote:
| Unless you were running the model locally in a way you control,
| it seems very possible that the model has additional hidden
| context regarding your location (perhaps based on IP, perhaps
| based on credit card billing information, etc).
| bongodongobob wrote:
| That kind of what I'm thinking, seems too good to be true.
| buzzerbetrayed wrote:
| It still works if you upload an image that was taken on the
| opposite side of the globe from where you are.
| Workaccount2 wrote:
| The hidden context is probably many other photos taken in
| that area appearing in the training set.
| christianqchung wrote:
| ChatGPT doesn't let you share chats with images (why?), but
| here is what o3 said for an image with memory turned off:
|
| "Thought for 11 seconds
|
| It's hard to pin down an exact spot from this single photo,
| but a few visual clues narrow the possibilities:
|
| Temperate-zone trees: Most of the trees have lost their
| leaves, but scattered white-pines or spruces remain green.
| That mix--deciduous hardwoods plus a sprinkling of conifers--
| is typical of the northeastern United States and southeastern
| Canada.
|
| Broad, moderately fast river: The channel looks 75-100 m wide
| with swirling surface currents but no large rapids,
| suggesting a mid-sized river in its lower reaches--something
| like the Merrimack, Connecticut, Delaware, or Susquehanna
| rather than a small mountain stream.
|
| Late-winter / early-spring light: Bare branches and a clear
| blue sky with fairly high sun angle hint at March-April in
| that latitude range.
|
| So my best guess is that you're standing on the bank of a
| northeastern U.S. river--perhaps somewhere in New England or
| the Mid-Atlantic--during late winter or early spring. If you
| can share any nearby landmarks, town names, or additional
| photos, I might be able to zero in more precisely."
|
| With memory turned on:
|
| "Thought for 5 seconds
|
| Beautiful spot! From what I can see:
|
| a broad, slow-moving river with a sandy/rocky bank in the
| foreground
|
| bare deciduous trees mixed with some white pines on the far
| shore (so it's probably early spring or late autumn)
|
| no mountains on the horizon, just gently rolling terrain
| typical of low-lying floodplains in the U.S. Northeast
|
| That mix of hardwoods and pines, plus the flat, braided water
| surface, reminds me a lot of the Merrimack or Concord Rivers
| in Massachusetts or southern New Hampshire--both have public
| footpaths where you can stand right at the water's edge like
| this.
|
| Of course, many temperate rivers look similar. If you can
| give me a hint (e.g., the town or nearby landmark), I can
| narrow it down further--maybe even point you to trail maps or
| history of the exact stretch you photographed."
|
| Southern NH is correct, and I am certain it is drawing from
| memory/past chats. However, I can't replicate a specific
| behavior I once had, which is in temporary chat (no past
| chat/memory enabled), it said that it guessed where the photo
| was taken based on my location.
| simonw wrote:
| "ChatGPT doesn't let you share chats with images (why?)"
|
| Probably because if you uploaded pornography (or illegal
| imagery) to ChatGPT and then shared a link with the world
| it would be embarrassing for OpenAI.
| christianqchung wrote:
| Wouldn't that apply to any website with image hosting
| abilities though? Why does that apply to OpenAI in
| particular?
|
| On an unrelated note, I like your blog.
| simonw wrote:
| My guess is that OpenAI are risk averse on this
| particular issue, because people could come up with some
| very creative awful scenarios with ChatGPT and image
| analysis. "Which of these people looks the most stupid"
| plus a jailbreak would be instant negative press.
| causality0 wrote:
| Could it have used data you inadvertently supplied it, like the
| location from which you uploaded the image?
| tough wrote:
| chatGPT has metadata about you
|
| only can try proof this correctly on a fresh anon guest vpn
| session
| pugworthy wrote:
| Ask it to explain how it came to the conclusion. It does give
| you a pretty logical sequence of how it got there.
|
| Also, if it's wrong, tell it so and give it a hint to get it
| closer. It's kind of fun to guide it to the right answer.
| julianhuang wrote:
| 1. The "master geoguesser" is a bit misleading--as mentioned in
| his blog post, there are players far better than him, and he is
| certainly not the bar for human supremacy. Probably analogous to
| a 1400-1800 elo chess player. 2. o3 isn't the best model at
| playing GeoGuessr, Gemini 1.5 & 2.5 solidly beat it out--for
| those interested, check out my friend's benchmark
| (https://geobench.org/) and blog post
| (https://ccmdi.com/blog/GeoBench) detailing interesting model
| explanations. 3. In the post, he only tests on one game--o3's
| average score over 100 locations (20 5-location games) was
| 19,290, far lower than the 23,179 in the game. Model geolocation
| capabilities are really important to keep track of, but the
| specific blog post in question isn't anything out of the
| ordinary. LLMs are making geolocation abilities much more
| accessible, but still fall short compared to 1. top GeoGuessr
| players playing GeoGuessr (only google streetview coverage,
| without web search) and 2. professional geolocators, who are
| proficient at using a wide variety of software/search. I.e., if
| the CIA wanted to find someone using an image, LLMs would not
| provide them any unique ability to do so as opposed to someone
| like Rainbolt
| kenjackson wrote:
| From your linked article: "It's clear that large language
| models have an emergent capability to play GeoGuessr, and play
| it well. Their abilities to generalize are nascent, but
| present"
|
| This is very accurate -- their abilities to generalize are
| nascent, but still surprisingly capable. The world is about to
| send through its best and brightest math/CS minds over the next
| decade (at least) to increase the capabilities of these AIs
| (with the help of AI). I just don't understand the pessimism
| with the technology.
| julianhuang wrote:
| I completely agree that this is an incredible advancement as
| someone who has watched the rise of LLMs' GeoGuessr
| abilities. I just wanted to qualify the claim made in the
| blog post "In Which I Try to Maintain Human Supremacy for a
| Bit Longer". I also think that models would need to become
| far more proficient at tool use (querying OpenStreetMap
| features, coverage-checking Google Maps, calculating sun
| direction, etc.) that current human geolocators have access
| to, and precise spatial reasoning. Additionally, there is a
| whole corpus of GeoGuessr-specific knowledge that probably
| wouldn't have a reason to be in model training data (Google
| Street View coverage quirks, the specific geographic ranges
| of vegetation and architecture, tiny visual idiosyncrasies in
| country infrastructure like the width of red on a sign,
| etc.). However, I think this could probably be solved with
| more data, and I don't think there is any fundamental
| barrier.
| fmbb wrote:
| How fast are they compared to human players?
| SamPatt wrote:
| I'm the author - I tried to be as upfront as possible about my
| skill level in the post.
|
| The human supremacy line is just a joke, there are already
| models specifically trained for Geoguessr which are already
| beating the best players in the world, so that ship has sailed.
|
| That geobench work is really cool, thanks for sharing it.
| amrrs wrote:
| It's thinking process to go about guessing a place is further
| fascinating. Even o4 mini high is quite good[1] and very fast.
|
| But unlike a geogussr, it uses websearch[1] [1]
| https://youtu.be/P2QB-fpZlFk?si=7dwlTHsV_a0kHyMl [1]
| bredren wrote:
| Neat to see progress of this from Simon's original post to
| comment to this.
| vunderba wrote:
| The original post was actually this one - two weeks prior.
|
| https://news.ycombinator.com/item?id=43723408
|
| https://flausch.social/@piegames/114352447253793517
| asdsadasdasd123 wrote:
| This is probably one of the less impressive LLM applications imo.
| Like it already knows what every plant, street sign, etc is. I
| would imagine a traditional neural net would do really well here
| as well if you can extract some crude features.
| EGreg wrote:
| Cant the same be said about " _unimpressive_ " behavior by
| coding LLMs that know every algorithm, language and library?
| asdsadasdasd123 wrote:
| Disagree because code has to be far more precise than, the
| location is in the jungles of brazil. This level of coding as
| never been achievable by traditional ML methods AFAIK
| exitb wrote:
| I tried a picture of Dublin and it pointed out the hotel I took
| it from. Obviously that's more data than any single person can
| keep in their head.
| OtherShrezzing wrote:
| It's my understanding that o3 was trained on multimodal data,
| including imagery. Is it unreasonable to assume its training data
| includes images of these exact locations and features? GeoGuesser
| uses Google Maps, and Google Maps purchases most of its imagery
| from third-parties these days. If those third parties aren't also
| selling to all the big AI companies, I'd be very surprised.
| Yenrabbit wrote:
| Try it with your own personal photos. It is scarily good!
| rafram wrote:
| That's true for heavily photographed urban areas. I've tried
| it on some desert photos (even at identifiable points with
| human structures) and it basically just guesses a random
| trailhead in Joshua Tree and makes up a BS explanation for
| why it matches.
| kube-system wrote:
| I have had surprisingly good luck with beach photos that
| don't have much beyond dunes and vegetation in them
| walthamstow wrote:
| Good luck meaning o3 guessed it right or wrong?
| kube-system wrote:
| o3 made very accurate guesses, and had plausible
| explanations for the features it analyzed
| GaggiX wrote:
| It does work well with images you have taken, not just
| Geoguessr: https://simonwillison.net/2025/Apr/26/o3-photo-
| locations/
| thi2 wrote:
| > I'm confident it didn't cheat and look at the EXIF data on
| the photograph, because if it had cheated it wouldn't have
| guessed Cambria first.
|
| Hm no way to be sure though, would be nice to do another run
| without Exif information
| pests wrote:
| > Google Maps purchases most of its imagery from third-parties
| these days
|
| Maps maybe, but Streetview? Rainbolt just did a video with two
| Maps PMs recently and it sounds like they still source all
| their street view themselves considering the special camera and
| car needed, etc.
| mikeocool wrote:
| My understanding is you're correct -- Google still captures a
| lot of their own street view imagery.
|
| Though there are other companies that capture the same sorts
| of imagery and license it. TomTom imagery is used on the Bing
| Maps street view clone.
| OtherShrezzing wrote:
| Maybe the end-user isn't Google Maps, but TomTom have a
| pretty comprehensive street-view-ish product for private
| buyers like car companies, Bing and Apple Maps called MoMa.
|
| I'd be surprised if this building[0] wasn't included in their
| dataset from every road-side angle possible, alongside every
| piece of locational metadata imaginable, and I'd be surprised
| if that dataset hasn't made it into OpenAI's training data -
| especially when TomTom's relationship to Microsoft, and
| Microsoft's relationship to OpenAI, is taken into account.
|
| [0] https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025
| -04...
| shpx wrote:
| You can upload your own panoramic images to Street View,
| people do this for hiking trails. But I'm sure 99% of
| streetview imagery is Google-sourced and Geoguessr might not
| even use user-submitted imagery.
|
| https://www.google.com/streetview/contribute/
| pests wrote:
| I believe Geogesser categorizes their games on this facet.
| Rainbolt plays on only official imagery.
| cpeterso wrote:
| Here's a link to that interview: https://youtu.be/2T6pIJWKMcg
| thrance wrote:
| A machine that's read every book ever written, seen every photo
| ever taken, visited every streets on Earth... That feels a
| little frightening.
| rafram wrote:
| From one of o3 outputs:
|
| > Rear window decal clearly reads "www.taxilinder.at". A quick
| lookup shows Taxi Linder GmbH is based in Dornbirn, Vorarlberg.
|
| That's cheating. If it can use web search, it isn't playing fair.
| Obviously you can get a perfect score on any urban GeoGuessr
| round by looking up a couple businesses, but that isn't the
| point.
| artninja1988 wrote:
| It is against the rules? I thought it's all fair game, but you
| are time constrained
| rafram wrote:
| Yes: https://www.geoguessr.com/community-rules
| sltkr wrote:
| But if anything, those rules benefit ChatGPT: it can
| remember ~all of Wikipedia and translate ~every language on
| Earth, while a human would need access to online services
| for that.
|
| If anything, I'd think allowing looking stuff up would
| benefit human players over ChatGPT (though humans are
| probably much slower at it, so they probably lose on time).
| kbenson wrote:
| If it takes a model and database with a large chunk of
| the internet to compete and win, then that says
| something, as that's much more expensive and complex than
| just the model, because models have problems
| "remembering" correctly just like people.
|
| It's important to have fair and equivalent testing not
| because that allows people to win, but because it shows
| where the strengths and weaknesses of people and current
| AI actually are in a useful way.
| cocoto wrote:
| Connecting an LLM to the web or database is something
| cheap, not something expensive.
| kbenson wrote:
| I'm not sure how to make sense of this in the context of
| what we're discussing. Access to the web is exactly
| what's in question, and emulating the internet to a
| degree you don't actually need to access it to have the
| information is very expensive in resources because of how
| massive the dataset is, which is the point I was making.
| Gud wrote:
| Same with a human.
| twojacobtwo wrote:
| Why was this downvoted? It's a fair question and it wasn't
| stated as fact.
| 3np wrote:
| Because an accepted answer to that specific question is
| invariably a link/reference that the asker could have
| searched for (and posted if they think it's useful for the
| discussion) themselves directly, instead of putting that
| burden on the rest of us and amortizing everyone's
| attention. It's entitled and lazy.
|
| Alternative example: "I wondered what the rules actually
| say about web search and it is indeed not allowed: (link)"
| silveraxe93 wrote:
| Yeah, the author does note that in the article. He also points
| it out in the conclusion:
|
| > If it's using other information to arrive at the guess, then
| it's not metadata from the files, but instead web search. It
| seems likely that in the Austria round, the web search was
| meaningful, since it mentioned the website named the town
| itself. It appeared less meaningful in the Ireland round. It
| was still very capable in the rounds without search.
| rafram wrote:
| Seems like they should've just repeated the test. But without
| the huge point lead from the rounds where it cheated, it
| wouldn't have looked very impressive at all.
| silveraxe93 wrote:
| People found the original post so impressive they were
| saying that it had to be coming from cheating by looking at
| EXIF data. The point of this article was to show it
| doesn't. It got an unfair advantage in 1 (and say 0.5) out
| of 5. With the non-search rounds still doing great.
|
| If you think this is unimpressive, that's subjective so
| you're entitled to believe that. I think that's awesome.
| godelski wrote:
| Sorry, I think I misread you. I think you said
| People accused it of cheating by reading EXIF data. They
| were wrong, it cheated by using web search. That makes
| the people that accused it of cheating wrong and this
| post proves that.
|
| And is everyone forgetting that what OpenAI shows you
| during the CoT is not the full CoT? I don't think you can
| fully rely on that to make claims about when it did and
| didn't search
| SamPatt wrote:
| That's inaccurate. It beat me by 1,100 points, and given
| the chain of thought demonstrated that it knew the general
| region of both guesses before it employed search, it would
| likely have still beaten me in those rounds. Though
| probably by fewer points.
|
| I will try it again without web search and update the post
| though. Still, if you read the chain of thought, it
| demonstrates remarkable capabilities in all the rounds. It
| only used search in 2/5 rounds.
| godelski wrote:
| I'd be interested at capabilities without web search. The
| displayed CoT isn't the full CoT so it's hard to know if
| it really is searching or not. I mean it isn't always
| obvious when it does. Plus, the things are known to lie
| -\\_(tsu)_/-
| SamPatt wrote:
| I do understand the skepticism, and I'll run it again
| without search to see what happens.
|
| But a serious question for you: what would you need to
| see in order to be properly impressed? I ask because I
| made this post largely to push back on the idea that EXIF
| data matters and the models aren't that capable. Now the
| criticism moves to web search, even though it only
| mattered in one out of five rounds.
|
| What would impress you?
| mattmanser wrote:
| You're kinda being your own worse enemy though.
|
| "Technically cheating"? Why even add the "technically".
|
| It just gives the impression that you're not really
| objectively looking for any smoke and mirrors by the AI.
| SamPatt wrote:
| I hear you - but I had already read through the chain of
| thought which identified the right region before search,
| and had already seen the capabilities in many other
| rounds. It was self-evident to me that the search wasn't
| an essential part of the model's capabilities by that
| point.
|
| Which turned out to be true - I re-ran both of those
| rounds, without search this time, and the model's guesses
| were nearly identical. I updated the post with those
| details.
|
| I feel like I did enough to prove that o3's geolocation
| abilities aren't smoke and mirrors, and I tried to be
| very transparent about it all too. Do you disagree? What
| more could I do to show this objectively?
| godelski wrote:
| > What would impress you?
|
| I want to be clear that you tainted the capacity to
| impress me by the clickbait title. I don't think it was
| through malice, but I hope you realize the title is
| deceptive.[0] (Even though I use strong language, I do
| want to clarify I don't think it is malice)
|
| To paraphrase from my comment: if you oversell and under
| deliver, people feel cheated, even if the deliverable is
| revolutionary.
|
| So I think you might have the wrong framing to achieve
| this goal. I am actually a bit impressed by O3's
| capabilities. But at the same time you set the bar high
| and didn't go over or meet it. So that's going to really
| hinder the ability to impress. On the other hand, you set
| the bar low, it usually becomes easy to. It i slike when
| you have low expectations for a movie and it's mediocre
| you still feel good, right?
|
| [0] https://news.ycombinator.com/item?id=43836791
| SamPatt wrote:
| I did repeat the test without search, and updated the post.
| It made no difference. Details here:
|
| https://news.ycombinator.com/item?id=43837832
| clhodapp wrote:
| The question is not only how much it helped the AI model but
| rather how much it _would have_ helped the human.
|
| This is because the AI model _could_ have chosen to run a
| search whenever it wanted (e.g. perhaps if it knew how to
| leverage search better, it could have used it more).
|
| In order for the results to be meaningful, the competitors
| have to play by the same rules.
| Ukv wrote:
| The author did specifically point out that
|
| > Using Google during rounds is technically cheating - I'm
| unsure about visiting domains you find during the rounds
| though. It certainly violates the spirit of the game, but it
| also shows the models are smart enough to use whatever
| information they can to win.
|
| and had noted in the methodology that
|
| > Browsing/tools -- o3 had normal web access enabled.
|
| Still an interesting result - maybe more accurate to say
| O3+Search beats a human, but could also consider the search
| index/cache to just be a part of the system being tested.
| spookie wrote:
| A human can also use the same tools if it wasn't for the
| rules or fair play. They should've simply redone the test.
| ceph_ wrote:
| The AI should be forced to use the same rules as the human.
| Not the other way around. The AI shouldn't be using outside
| resources.
| bscphil wrote:
| I think that's part of the point they're making, hence
| "They should've simply redone the test."
| voxic11 wrote:
| Another rule bans "using third-party software or scripts
| in order to gain an unfair advantage over other players."
|
| So is it even possible for O3 to beat another player
| while complying with the rules?
| ben_w wrote:
| If a player uses such a model, the model is third-party
| and the player is cheating.
|
| But: when a specific model is itself under test, I would
| say that during the test it becomes "first" (or second?)
| party rather than "third".
| krferriter wrote:
| An AI being better than a human at doing a google search
| and then skimming a bunch of pages to find location-related
| terms isn't as interesting of a result.
| inerte wrote:
| How the heck is not? Computers are looking into
| screenshots and searching the internet to support their
| "thinking", that's amazing! Have we become so used to AI
| and what was impossible 6 months ago is shruggable today?
|
| I've being doing this MIND-Dash diet lately and it's
| amazing I can just take a picture of whatever
| (nutritional info / ingredients are perfect for that) and
| just ask if it fits my plan and it tells me back into
| what bucket it falls, with detailed breakdown of macros
| in support of some additional goals I have (muscle
| building for powerlifting). It's amazing! And it does in
| 2 minutes passively what it would take me 5-10 active
| search.
| godelski wrote:
| In the same way a calculator performing arithmetic faster
| than humans isn't impressive. The same way running regex
| over a million lines and the computer beating a human in
| search isn't impressive
| ludwik wrote:
| Neither is impressive solely because we've gotten used to
| them. Both were mind-blowing back in the day.
|
| When it comes to AI - and LLMs in particular - there's a
| large cohort of people who seem determined to jump
| straight from "impossible and will never happen in our
| lifetime" to "obvious and not impressive", without
| leaving any time to actually be impressed by the
| technological achievement. I find that pretty baffling.
| godelski wrote:
| I agree, but without removing search you cannot decouple.
| Has it embedded a regex method and is just leveraging
| that? Or is it doing something more? Yes, even the regex
| is still impressive but it is less impressive that doing
| something more complicated and understanding context and
| more depth.
| rowanG077 wrote:
| I think both are very impressive, world shattering
| capabilities. Just because they have become normalized
| doesn't make it any less impressive in my view.
| ekidd wrote:
| I fully expect that someday the news will announce, "The
| AI appears to be dismantling the moons of Jupiter and
| turning them into dense, exotic computational devices
| which it is launching into low solar orbit. We're not
| sure _why_. The AI refused to comment. "
|
| And someone will post, "Yeah, but that's just computer
| aideded design and manufacturing. It's not _real_ AI. "
|
| The first rule of AI is that the goalposts _always_ move.
| If a computer can do it, by definition, it isn 't "real"
| AI. This will presumably continue to apply even as the
| Terminator kicks in the front door.
| eru wrote:
| Yes, but I choose to interpret that as a good thing. It
| is good that progress is so swift and steady that we can
| afford to keep moving the goalposts.
|
| Take cars as a random example: progress there isn't fast
| enough that we keep moving the goalposts for eg fuel
| economy. (At least not nearly as much.) A car with great
| fuel economy 20 years ago is today considered at least
| still good in terms of fuel economy.
| jug wrote:
| Yeah, it's a funny take because this is in fact a more
| advanced form of AI with autonomous tool use that is just
| now emerging in 2025. You might say "They could search
| the web in 2024 too" but that wasn't autonomous on its
| own, but required telling so or checking a box. This one
| is piecing ideas together like "Wait, I should Google for
| this" and that is specifically a new feature for OpenAI
| o3 that wasn't even in o1.
|
| While it isn't entirely in the spirit of GeoGuesser, it
| is a good test of the capabilities where being great at
| GeoGuesser in fact becomes the lesser news here. It will
| still be if disabling this feature.
| SamPatt wrote:
| That isn't what's happening though. I re-ran those two
| rounds, this time without search, and it changed nothing.
| I updated the post with details, you can verify it
| yourself.
|
| Claiming the AI is just using Google is false and
| dismissing a truly incredible capability.
| arandomhuman wrote:
| But then they couldn't make a click bait title for the
| article.
| _heimdall wrote:
| This seems like a great example of why some are so concerned
| with AI alignment.
|
| The game rules were ambiguous and the LLM did what it needed
| to (and was allowed to) to win. It probably is against the
| spirit of the game to look things up online at all but no one
| thought to define that rule beforehand.
| umanwizard wrote:
| No, the game rules aren't ambiguous. This is 100%
| unambiguously cheating. From the list of things that are
| definitely considered cheating in the rules:
|
| > using Google or other external sources of information as
| assistance during play.
|
| The contents of URLs found during play is clearly an
| external source of information.
| GaggiX wrote:
| I believed the rules were not explained to the model so
| it does use what it can.
| misnome wrote:
| Then you can 100% not claim it is "Playing" the game
| ben_w wrote:
| That right there * _is the alignment problem_ *.
|
| If I task an AI with "peace on earth" and the solution
| the AI comes up with is ripped from The X-Files* and it
| kills everyone, it isn't good enough to say "that's
| cheating" or "that's not what I meant".
|
| * https://en.wikipedia.org/wiki/Je_Souhaite
| GaggiX wrote:
| It's playing a game in which the rules are a bit
| ambiguous if not explained.
| tshaddox wrote:
| o3 already is an external source of information. It's an
| online service backed by an enormous model generated from
| an even more enormous corpus of text via an enormous
| amount of computing power.
| godelski wrote:
| Pointing out that it is cheating doesn't excuse the lie in
| the headline. That just makes it bait and switch, a form of
| fraud. OP _knew_ they were doing a bait and switch.
|
| I remember when we were all pissed about clickbait headlines
| because they were deceptive. Did we just stop caring?
| bahmboo wrote:
| The headline said the AI beat him, it did not say it beat
| him in a GeoGuessr game. The article clearly states what he
| did and why.
| SecretDreams wrote:
| What's your suggestion for an alternative headline?
| godelski wrote:
| Can O3 Beat a Master-Level GeoGuessr? How Good is
| O3 at GeoGuessr? EXIF Does Not Explain O3's
| GeoGuessr's Performance O3 Plays GeoGuessr (EXIF
| Removed)
|
| But honestly, OP had the foresight to remove EXIF data
| and memory from O3 to reduce contamination. The goal of
| the blog post was to show that O3 wasn't cheating. So by
| including search, they undermine the whole point of the
| post.
|
| The problem really stems from the lack of foresight. Lack
| of misunderstanding the critiques they sought to address
| in the first place. A good engineer understands that when
| their users/customers/<whatever> makes a critique, that
| what the gripe is about may not be properly expressed.
| You have to interpret your users complaints. Here, the
| complaint was "cheating", not "EXIF" per se. The EXIF
| complaints were just a guess at the mechanism in which it
| was cheating. But the complaint was still about cheating.
| SamPatt wrote:
| >The goal of the blog post was to show that O3 wasn't
| cheating.
|
| No, the goal of the post was to show that o3 has
| incredible geolocation abilities. It's through the lens
| of a Geoguessr player who has experience doing
| geolocation, and my perspective on whether the chain of
| thought is genuine or nonsense.
|
| In Simon's original post, people were claiming that o3
| doesn't have those capabilities, and we were fooled by a
| chain of thought that was just rationalizing the EXIF
| data. It only had the _appearance_ of capability.
|
| The ability to perform web search doesn't undermine the
| claim that o3 has incredible geolocation abilities,
| because it still needs to have an underlying capability
| in order to know what to search. That's not true for
| simply reading EXIF data.
|
| This is the best way I knew to show that the models are
| doing something really neat. Disagreements over the exact
| wording of my blog post title seem to be missing the
| point.
| sdenton4 wrote:
| The people pissed about clickbait headlines were often
| overstating things to drum up outrage and accumulate more
| hacker news upboats...
| godelski wrote:
| I'm not sure why you're defending clickbait. It is just
| fraud. I'm not sure why we pretend it is anything
| different.
|
| Sure, people made overblown claims about the effects, but
| that doesn't justify fraud. A little fraud is less bad
| than major fraud, but that doesn't mean it isn't bad.
| jasonlotito wrote:
| One of the rules is banning the use of third-party software
| or scripts.
|
| Any LLM attempting to play will lose because of that rule.
| So, if you know the rules, and you strictly adhere to them
| (as you seem to be doing) than no need to click on the
| link. You already know it's not playing buy GeoGuesser
| rules.
|
| That being said, if you are running a test, you are free to
| set the rules as you see fit and explain so, and under the
| conditions set by the person running the test, these are
| the results.
|
| > Did we just stop caring?
|
| We stopped caring about pedantry. Especially when the
| person being pedantic seems to cherry pick to make their
| point.
| kenjackson wrote:
| Technically the LLM is 3rd party software so the use of
| it is cheating. QED
| godelski wrote:
| This doesn't mean you shouldn't try to make things as far
| as possible. Yes, it would still _technically_ violate
| rules, but don 't pretend like this is binary.
| > We stopped caring about pedantry
|
| Did we? You see to be responding to my pedantic comment
| with a pedantic comment.
| 627467 wrote:
| Cheating implies there's a game. There isn't.
|
| > Titles and headlines grab attention, summarize content,
| and entice readers to engage with the material
|
| I'm sorry you felt defrauded instead. To me the title was
| very good at conveying to me the ability of o3 in
| geolocating photos.
| hatthew wrote:
| Title says o3 beat a [human] player. That implies there
| is some competition that has the capacity to be fair or
| unfair.
| sebzim4500 wrote:
| Presumably being an AI is technically against the
| GeoGuessr rules so I don't see how there can be an
| expectation that those rules were followed.
| tshaddox wrote:
| Sure, but o3 is itself already an online service backed
| by an enormous data set, so regardless of whether it also
| searched the web, it's clearly not literally "playing
| fair" against a human.
| godelski wrote:
| But it still bounds the competition. OP is skilled in the
| domain. I'm not, so if I wrote a post about how O3 beat
| me you'd be saying how mundane of a result it is. I mean,
| I suck at Geoguesser. Beating me isn't impressive. This
| is also a bound
| layman51 wrote:
| Using the decal as a clue is funny because what if there was a
| street scene where that happened to be misleading? For example,
| I had seen that a Sacramento County Sheriff car got to Europe
| and I guess it now belonged to a member of the public who is
| driving it with the original decals still attached. I wonder
| how the LLM would reason if it sees the car as "out of place".
| victorbjorklund wrote:
| Probabilities. That could happen with anything. Someone could
| build a classic japanese house with a japanese garden in
| Hawaii. But Japan is probably a better guess if you see a
| japanese house with japanese fauna.
| yen223 wrote:
| Stands to reason a human might get fooled by this as well
| SamPatt wrote:
| Absolutely!
|
| It happens occasionally - the most common example I can
| think of it getting a license plate or other location from
| a tractor-trailer (semi) on the highway. Those are very
| unreliable.
|
| You also sometimes get flags in the wrong countries,
| immigrants showing their native pride or even embassies.
| SamPatt wrote:
| Author here - it's a fair criticism, and I point it out in the
| article. However, I kept it in for a few reasons.
|
| I'm trying to show the model's full capabilities for image
| location generally, not just playing geoguessr specifically.
| The ability to combine web search with image recognition,
| iteratively, is powerful.
|
| Also, the web search was only meaningful in the Austria round.
| It did use it in the Ireland round too, but as you can see by
| the search terms it used, it already knew the road solely from
| image recognition.
|
| It beat me in the Colombia round without search at all.
|
| It's worthwhile to do a proper apples and apples comparison -
| I'll run it again and update the post. But the point was to
| show how incredibly capable the model is generally, and the
| lack of search won't change that. Just read the chain of
| thought, it's incredible!
| k4rli wrote:
| It's still as much cheating as googling. Completely
| irrelevant. Even if it were to beat Blinky, it's not
| different from googlers/scripters.
| SamPatt wrote:
| I disagree. I ran those rounds again, without search this
| time, and the results were nearly identical:
|
| https://news.ycombinator.com/item?id=43837832
| IanCal wrote:
| I tried the image without search and it talked about Dornbirn
| anyway but ended up choosing Bezau which is really quite
| close.
|
| edit - the models are also at a disadvantage in a way too,
| they don't have a map to look at while the pick the location.
| SamPatt wrote:
| Yes, I re-ran those rounds and it made the same guesses
| without search, within 1km I believe.
|
| You're right about not having a map - I cannot imagine
| trying to line up the Ireland coast round without
| referencing the map.
| LeifCarrotson wrote:
| There's some level at which an AI 'player' goes from being
| competitive with a human player, matching better-trained
| human strategy against a more impressive memory, to just a
| cheaty computer with too much memorization. Finding that
| limit is the interesting thing about this analysis, IMO!
|
| It's not interesting playing chess against Stockfish 17, even
| for high-level GMs. It's alien and just crushes every human.
| Writing down an analysis to 20 move depth, following some
| lines to 30 or more, would be cheating for humans. It would
| take way too long (exceeding any time controls and more
| importantly exceeding the lifetime of the human), a powerful
| computer can just crunch it in seconds. Referencing a
| tablebase of endgames for 7 pieces would also be cheating,
| memorizing 7 terabytes of bitwise layouts is absurd but the
| computer just stores that on its hard drive.
|
| Human geoguessr players have impressive memories way above
| baseline with respect to regional infrastructure, geography,
| trees, road signs, written language, and other details.
| Likewise, human Jeopardy players know an awful lot of trivia.
| Once you get to something like Scrabble or chess, it's less
| and less about knowing words or knowing moves, but more about
| synthesizing that knowledge intelligently.
|
| One would expect a human to recognize some domain names like,
| I don't know, osu.edu: lots of people know that's Ohio State
| University, one of the biggest schools in the US, located in
| Columbus, Ohio. They don't have to cheat and go to an
| external resource. One would expect a human (a top human
| player, at least) to know that taxilinder.at is based in
| Austria. One would never expect any human to have every
| business or domain name memorized.
|
| With modern AI models trained on internet data, searching the
| internet is not that different from querying its own training
| data.
| mrlongroots wrote:
| To reframe your takeaway: you want to benchmark the
| "system" and see how capable it is. The boundaries of the
| system are somewhat arbitrary: is it "AI + web" or "only
| AI", and it is not about fairness as much as about "what do
| you, the evaluator, want to know".
| tshaddox wrote:
| > There's some level at which an AI 'player' goes from
| being competitive with a human player, matching better-
| trained human strategy against a more impressive memory, to
| just a cheaty computer with too much memorization. Finding
| that limit is the interesting thing about this analysis,
| IMO!
|
| And a lot of human competitions aren't designed in such a
| way that the competition even makes sense with "AI." A lot
| of video games make this pretty obvious. It's relatively
| simple to build an aimbot in a first-person shooter that
| can outperform the most skilled humans. Even in ostensibly
| strategic games like Starcraft, bots can micro in ways that
| are blatantly impossible for humans and which don't really
| feel like an impressive display of Starcraft skill.
|
| Another great example was IBM Watson playing Jeopardy! back
| in 2011. We were supposed to be impressed with Watson's
| natural language capabilities, but if you know anything
| about high-level Jeopardy! then you know that all you were
| really seeing is that robots have better reflexes than
| humans, which is hardly impressive.
| rowanG077 wrote:
| You seem indicate you want a computer to beat a human
| without ever using what a computer is actually good
| at(large memories, brute force compute etc). That seems a
| little ridiculous to me. How do you want it to engage?
| Disallowed use of native compute and must simulate a full
| human brain?
|
| Sure I do agree that the web search is too far, because
| it's literally cheating. But stockfish is super human at
| chess, it doesn't really matter that it can do this by
| leveraging the strengths of a computer.
| vlovich123 wrote:
| > It's not interesting playing chess against Magnus, even
| for high-level GMs. He just crushes almost every human
|
| The differences even among humans between the absolute best
| & those out of the top 10 tend to be pretty drastic. And a
| non-IM against Magnus won't even understand what's going
| on. You could similarly claim that Magnus just memorized a
| bunch of openings which is similar to criticism GMs level
| too which is why Chess960 is now gaining more traction. My
| point is that there's not really such a thing as "fair" in
| a competition.
|
| Re geoguessr, why not let them whatever tools are
| available? I have similar critiques about bike racing &
| restrictions on the technology advancements they can put on
| the bike. But every competition chooses arbitrary lines to
| draw which compose the rules so it doesn't really matter.
| sensanaty wrote:
| I mean Geoguessr explicitly states when you launch the
| game (in PvP mode) that googling/searching is bannable
| WhitneyLand wrote:
| As models continue to evolve it may not even need to cheat.
|
| Since web scale data is already part of pre-training this info
| is in principle available for most businesses without a web
| search.
|
| The exceptions would be if it's recently added, or doesn't
| appear often enough to generate a significant signal during
| training, as in this case with a really small business.
|
| It's not hard to imagine base model knowledge improving to the
| point where it's still performing at almost the same level
| without any web search needed.
| ricardo81 wrote:
| >isn't playing fair.
|
| the idea of having nth more dimensions of information, readable
| and ingestible within a short frame of time probably isn't
| either.
| rowanG077 wrote:
| Isn't it? We totally accept this fact for humans. Some people
| are just insane at memorization. We don't call unfair that
| they use this to their advantage compared to someone who
| doesn't have that talent.
| ACS_Solver wrote:
| I just tried (o4-mini-high) and had it come to the wrong
| conclusion when I asked about the location and date, because it
| didn't search the web. I have a photo of a bench with a sign
| mentioning the cancellation of an event due to the Pope's
| death. It impressively figured out the location but then
| decided that Pope Francis is alive and the sign is likely a
| prank, so the photo is from April Fools day.
|
| Then after I explicitly instructed it to search the web to
| confirm whether the Pope is alive, it found news of his death
| and corrected its answer, but it was interesting to see how the
| LLM makes a mistake due to a major recent event being after its
| cutoff.
| CamperBob2 wrote:
| To be fair, my local copy of R1 isn't doing any searching at
| all, but it frequently says "A search suggests..." or something
| along the lines.
| mrcwinn wrote:
| O3 is seriously impressive for coding, as well, with Codex. It
| seems far superior to 3.7-thinking, although it's also more
| expensive in my usage.
| gizmodo59 wrote:
| Agreed. O3 is the best model out there for the tasks Ive tried
| and coding is a fair chunk of it. Claude 3.7 and Gemini 2.5 pro
| seems to hallucinate more
| weinzierl wrote:
| I tried it with a couple of holiday shots and couple of shots
| from my window and it is nothing but amazing.
|
| That being said I noticed two things that probably hamper its
| performance - or make its current performance even more amazing -
| depending how you look at it:
|
| - It often tries to zoom in to decipher even minuscle text. This
| works brilliantly. Sometimes it tries to enhance contrast by
| turning the image into black and white with various threshold
| levels to improve the results, but in my examples it always went
| in the wrong direction. For example the text was blown out white,
| it failed, it turned it even ligher instead of darker, failed
| again, turned it into a white rectangle and gave up on the
| approach.
|
| - It seems not to have any access to Google Maps or even Open
| Street Maps and therefore fails to recognize steet patterns. This
| is even more baffling than the first point, because it is so
| unlike how I suppose human geo guessers work.
| kazinator wrote:
| This seems like a really silly category in which to be competing
| against machines.
|
| Machine learning could index million or faces, and then identify
| members of that set from pictures. Could you memorize millions of
| people, to be able to put a name to a face?
|
| Why not also compete againt grep -r to see who can find matches
| for a regex faster across your filesystem.
| bongodongobob wrote:
| But that's not what it is doing and why this is cool.
| Imnimo wrote:
| On the first image, from the model's CoT:
|
| >"I also notice Cyrillic text on a sign"
|
| Am I missing this somewhere? Is the model hallucinating this?
|
| I'd also be very interested to see a comparison against 4o. 4o
| was already quite good at GeoGuessr-style tasks. How big of a
| jump is o3?
| plyptepe wrote:
| Turn left and look at the post, there should be a Cyrillic text
| with a 3 underneath on the closest pole to you.
| Imnimo wrote:
| Oh I see, I had missed that o3 saw a second view of the
| scene, not just the screenshot in the write-up.
| j3s wrote:
| isn't anyone else horrified by this? the implication is that
| given an arbitrary picture, chatgpt can give you a very likely
| approximate location - expert level doxxing is in the hands of
| anyone with access to a chatgpt subscription.
|
| feels terrifying, especially for women.
| turtlebits wrote:
| It needs a lot of context. If its a private picture, it won't
| have enough information. I gave it a picture I took of my yard
| and it's guess spanned several US states.
|
| If its out in public, fair game?
| sr-latch wrote:
| the way i see it, before these tools, only someone with a lot
| of resources (or skills) could track down a location from a
| picture. now, anyone can do it.
|
| the best case outcome is people become more aware of the
| privacy implications of posting photos online
| micromacrofoot wrote:
| it wasn't that hard before, I've taught it to children, it's
| just that technical skills of the average person are incredibly
| low
|
| llms are basically shortcutting a wide swath of easily
| obtainable skills that many people simply haven't cared to
| learn
| usaar333 wrote:
| Been true since gpt-4.
| echelon wrote:
| I think this is _incredibly_ cool. As with many things, the
| good cases will outnumber the bad.
|
| This was always possible, it just wasn't widely distributed.
|
| Having a first class ability to effectively geocode an image
| feels like it connects the world better. You'll be able to
| snapshot a movie and find where a scene was filmed, revisit
| places from old photographs, find where interesting locations
| in print media are, places that designers and creatives used in
| their (typically exif-stripped) work, etc.
|
| Imagine when we get this for architecture and nature. Or even
| more broadly, databases of food from restaurants. Products.
| Clothing and fashion. You name it.
|
| Imagine precision visual search for everything - that'd be
| amazing.
| daemonologist wrote:
| Keep in mind that this is o3 + web search against a human
| _without_ web search. A sufficiently motivated person with
| access to your entire social media history, Google Earth and
| Streetview, etc. would outperform this significantly and could
| pinpoint almost any inhabited location with coverage.
|
| If you watch Linus Tech Tips, you may have noticed that when he
| films at his house _everything_ is blurred out to keep people
| from locating it - here 's a recent example:
| https://www.youtube.com/watch?v=TD_RYb7m4Pw
|
| All that to say, unfortunately doxxing is already really hard
| to protect against. I don't think o3's capability makes the
| threat any harder to protect against, although it might lower
| the bar to entry somewhat.
| mopenstein wrote:
| Why especially women? Is the only thing stopping a person from
| being harmed is that their location isn't known? Especially
| women?
| red75prime wrote:
| Maxims like "within-group variance is larger than between-
| group variance" don't help in physical combat, you know.
| derfnugget wrote:
| "These models have more than an individual mind could conceivably
| memorize."
|
| ...so what? Is memorization considered intelligence? Calculators
| have similar properties.
|
| GeoGuessr is the modern nerds' Rubix Cube. The latest in "explore
| the world without risk of a sunburn".
| SamPatt wrote:
| Geoguessr is great fun, try it sometime.
| arm32 wrote:
| GeoGuessr aside, I really hope that this tech will be able to
| help save kids someday, e.g. help with FBI's ECAP
| (https://www.fbi.gov/wanted/ecap).
| thrance wrote:
| I wouldn't put too much hope on this technology bringing more
| good than harm to the world.
| ketzo wrote:
| If we don't actively try to identify and implement positive
| use cases, then yes, it'll definitely bring more harm than
| good.
|
| Isn't that all the more reason to call out our high hopes?
| thrance wrote:
| I don't know what in my comment made you think I was
| opposed to seeking positive applications of this
| technology.
|
| From the guidelines:
|
| > Please respond to the strongest plausible interpretation
| of what someone says, not a weaker one that's easier to
| criticize. Assume good faith.
| fkyoureadthedoc wrote:
| Oh, we're guidelines posting?
|
| > Don't be curmudgeonly. Thoughtful criticism is fine,
| but please don't be rigidly or generically negative.
| mopenstein wrote:
| But it will bring some percentage of good and some percentage
| of bad. Which ain't half bad, if you ask me.
| moritzwarhier wrote:
| What a quip! What if it's 51% bad?
| martinsnow wrote:
| What do you do when it flags you or someone you know who's
| innocent? Blindly trusting these models without any
| verification will put innocent people in prison. Normal
| people don't understand why they are so confident. They're
| confident because they believe all the data they have is
| correct. I forsee a future with many faux trials because
| they don't understand critical thinking.
| RussianCow wrote:
| > Blindly trusting these models without any verification
| will put innocent people in prison.
|
| I don't think anybody is suggesting this. But if the
| models can gleam information/insights that humans can't,
| that's still valuable, even if it's wrong some percentage
| of the time.
| snowe2010 wrote:
| This is what happened with dna testing at the beginning.
| Prosecutors claimed it was x percentage accurate when in
| fact it was hilariously inaccurate. People thought the
| data was valuable when it wasn't.
| 8organicbits wrote:
| If you are interested in the history of pseudoscience in
| the courtroom and methods for deciding what should be
| permitted in court, see:
| https://en.m.wikipedia.org/wiki/Daubert_standard
|
| It is, and will continue to be, a hard problem.
| mkoubaa wrote:
| The bad is already priced in. Nothing wrong with hoping for
| more good.
| parsimo2010 wrote:
| Looking at those photos, those are some crazy hard pictures-
| masked regions of the image, partially cropped faces, blurry,
| pictures of insides of rooms. I don't think any current LLM is
| going to be able to Sherlock Holmes their way into finding any
| of those people.
|
| Maybe they will one day if there's a model trained on a facial
| recognition database with every living person included.
| jvvw wrote:
| I'm Master level at Geoguessr - it's a rank where you have to
| definitely know what you are doing but it isn't as high as it
| probably sounds from the headline.
|
| Masters is about 800-1200 ELO whereas the pros are 1900-2000ish.
| I'll know the country straight away on 95% of rounds but I can
| still have no idea where I am in Russia or Brazil sometimes if
| there's no info. Scripters can definitely beat me!
| paulcole wrote:
| Gotta learn your Brazilian soil!
| windowshopping wrote:
| Was it worth it?
| rosstex wrote:
| I have 2000+ hours in Team Fortress 2. Was it worth it?
| Cyph0n wrote:
| Yes, it was. Granted, I probably have more than that.
| make3 wrote:
| it's a game, that's like asking why a public service is not
| profitable
| 650REDHAIR wrote:
| Yes? It's fun.
| SamPatt wrote:
| Yeah I added a "My skill level" section to talk through that.
| I'm far from a professional.
|
| But I know enough to be able to determine if the chain of
| thought it outputs is nonsense or comparable to a good human
| player. I found it remarkable!
| karaterobot wrote:
| I don't really follow OSINT, but I occasionally enjoy the fruits
| of that labor. I assume these models are all in heavy rotation
| for identifying a location based on an imperfect photograph. What
| are other practical implications of a model being better than a
| human at this?
| Sam6late wrote:
| I was wondering if this helps in detecting current spots from old
| aerial videos, say San Francisco in 2002, how cool would it be to
| juxtapose both in a new video, San Francisco in 2002:
| https://www.youtube.com/watch?v=vTR6iftL7yE
|
| or Dubai in 1997 https://www.youtube.com/watch?v=JMNXXiiDRhM
| sixtram wrote:
| I'm wondering if you feed all the Google street map photos into a
| special ML designed just for that, how important could that be
| for say the CIA or FBI?
| inetknght wrote:
| You'd have to be crazy naive to think three-letter agencies
| haven't already thought of it.
|
| That is: it's extremely valuable to them.
| godelski wrote:
| There's two important things here to consider when reading:
|
| 1) O3 cheated by using Google search. This is both against the
| rules of the game and OP didn't use search either
|
| 2) OP was much quicker. They didn't record their time but if
| their final summary is accurate then they were much faster.
|
| It's an apples to oranges comparison. They're both fruit and
| round, but you're ignoring obvious differences. You're cherry
| picking.
|
| The title is fraudulent as you can't make a claim like that when
| one party cheats.
|
| I would find it surprising if OP didn't know these rules
| considering their credentials. Doing this kind of clickbait
| completely undermines a playful study like this.
|
| Certainly O3 is impressive, but by over exaggerating its
| capabilities you taint any impressive feats with deception. It's
| far better to under sell than over sell. If it's better than
| expected people are happier, even if the thing is crap. But if
| you over sell people are angry and feel cheated, even if the
| thing is revolutionary. I don't know why we insist on doing this
| in tech, but if you're wondering why so many people hate "tech
| bros", this is one of the reasons. There's no reason to lie here
| either! Come on! We can't just normalize this behavior. It's just
| creating a reasonable expectation for people to be distrusting of
| technology and anything tech people say. It's pretty fucked up.
| And no, I don't think "it's just a blog post" makes it any
| better. It makes it worse, because it normalizes the behavior.
| There's other reasons to distrust big corporations, I don't want
| to live in a world where we should have our guards up all the
| time.
| SamPatt wrote:
| >1) O3 cheated by using Google search. This is both against the
| rules of the game and OP didn't use search either
|
| I re-ran it without search, and it made no difference:
|
| https://news.ycombinator.com/item?id=43837832
|
| >2) OP was much quicker. They didn't record their time but if
| their final summary is accurate then they were much faster.
|
| Correct. This was the second bullet point of my conclusion:
|
| >Humans still hold a big edge in decision time--most of my
| guesses were < 2 min, o3 often took > 4 min."
|
| I genuinely don't believe that I'm exaggerating or this is
| clickbait. The o3 geolocation capability astounded me, and I
| wanted to share my awe with others.
| shihabkhanbd wrote:
| The most interesting thing to me is how well AI and GeoGuessr fit
| together. Their specialty is recognizing patterns in large
| amounts of data which is exactly how human players play the game
| as well, just probably with faster and more capable recall
| abilities.
| simianparrot wrote:
| I too can beat a master level GeoGuessr if I'm allowed to cheat.
| Please add that info to the headline and be honest.
| jampa wrote:
| I was trying to play with o3 this week to see how close it can
| identify things, and, interestingly, it tries more pattern
| matching than its own "logic deduction". For example, it can
| easily deduce any of my photos from Europe and the US because
| there are many pictures online that I can search for and see
| similar pictures.
|
| However, when there are not many photos of the place online, it
| gets closer but stops seeking deeper into it and instead tries to
| pattern-match things in its corpus / internet.
|
| One example was an island's popular trail that no longer exists.
| It has been overgrown since 2020. It said first that the rocks
| are typical of those of an island and the vegetation is from
| Brazil, but then it ignored its hunch and tried to look for
| places in Rio de Janeiro.
|
| Another one was a popular beach known for its natural pools
| during low tides. I took a photo during high tide, when no one
| posts pictures. It captured the vegetation and the state
| correctly. But then it started to search for more popular places
| elsewhere again.
| ksec wrote:
| >But several comments intrigued me:
|
| >>I wonder What happened if you put fake EXIF information and
| asking it to do the same. ( We are deliberately misleading the
| LLM )
|
| Yay. That was me [1] which was actually downvoted for most of its
| time. But Thank You for testing out my theory.
|
| What I realised over the years is that comments do get read by
| people and do shape other people's thought.
|
| I honestly dont think looking up online is cheating. May be in
| terms of the game. But in real life situation which is most of
| the time it is absolutely the right thing to do. The chains of
| thought is scary. I still dont know anything about how AI works
| other than old garbage in, garbage out. But CoT is definitely
| something else. Even though the author said it is sometimes doing
| needless work, but in terms of computing resources I am not even
| sure if it matters as long as it is accurate. And it is another
| proof that may be, just may be AI taking over the world is much
| closer than I imagined.
|
| [1] https://news.ycombinator.com/item?id=43803985
| parsimo2010 wrote:
| My comment from the previous post:
|
| > I'm sure there are areas where the location guessing can be
| scary accurate, like the article managed to guess the exact town
| as its backup guess. But seeing the chain of thought, I'm
| confident there are many areas that it will be far less precise.
| Show it a picture of a trailer park somewhere in Kansas (exclude
| any signs with the trailer park name and location) and I'll bet
| the model only manages to guess the state correctly.
|
| This post, while not a big sample size, reflects how I would
| expect these models to perform. The model managed to be reliable
| with guessing the right country, even in pictures without a lot
| of visual information (I'll claim that getting the country
| correct in Europe is roughly equivalent to guessing the right
| state in the USA). It does sometimes manage to get the correct
| town, but this is not a reliable level of accuracy. The previous
| article only tested on one picture and it happened to get the
| correct town as its second guess and the author called it "scary
| accurate." I suppose that's a judgement call. To me, I've grown
| to expect that people can identify what country I'm in from a
| variety of things (IP address, my manner of speech, name, etc.),
| so I don't think that is "scary."
|
| I will acknowledge that o3 with web search enabled seems capable
| of playing GeoGuessr at a high level, because that is less of a
| judgement call. What I want to see now is an o3 GeoGuessr bot to
| play many matches and see what its ELO is.
| SamPatt wrote:
| Author here, I'm glad to see folks find this interesting.
|
| I encourage everyone to try Geoguessr! I love it.
|
| I'm seeing a lot of comments saying that the fact that the o3
| model used web search in 2 of 5 rounds made this unfair, and the
| results invalid.
|
| To determine if that's true, I re-ran the two rounds where o3
| used search, and I've updated the post with the results.
|
| Bottom line: It changed nothing. The guesses were nearly
| identical. You can verify the GPS coordinates in the post.
|
| Here's an example of why it didn't matter. In the Austria round,
| check out how the model identifies the city based on the mountain
| in the background:
|
| https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...
|
| It already has so much information that it doesn't need the
| search.
|
| Would search ever be useful? Of course it would. But in this
| particular case, it was irrelevant.
| bjourne wrote:
| What's your take on man vs. machine? If AI already beats Master
| level players it seem certain that it will soon beat the
| Geoguessr world champion too. Will people still derive pleasure
| from playing it, like with chess?
| SamPatt wrote:
| >Will people still derive pleasure from playing it, like with
| chess?
|
| Exactly - I see it just like chess, which I also play and
| enjoy.
|
| The only problem is cheating. I don't have an answer for
| that, except right now it's too slow to do that effectively,
| at least consistently.
|
| Otherwise, I don't care that a machine is better than I am.
| Townley wrote:
| A competitive geoguesser clearly got there through memorizing
| copious internet searching. So comparing knowledge retained in
| the trained model to knowledge retained in the brain feels
| surprisingly fair.
|
| Conversely, the model sharing, "I found the photo by crawling
| Instagram and used an email MCP to ask the user where they took
| it. It's in Austria" is unimpressive
|
| So independent from where it helps actually improve
| performance, the cheating/not cheating question makes for an
| interesting question of what we consider to be the cohesive
| essence of the model.
|
| For example, RAG against a comprehensive local filesystem would
| also feel like cheating to me. Like a human geoguessing in a
| library filled with encyclopedias. But the fact that vanilla O3
| is impressive suggests I somehow have an opaque (and totally
| poorly informed) opinion of the model boundary, where it's a
| legitimate victory if the model was birthed with that knowledge
| baked in, but that's it.
| groby_b wrote:
| For what it's worth, it's also pretty impressive with night sky
| images, if you give it an approximate date & time. (Yes, I know,
| it's not that impressive a skill, except the process was still
| pretty involved - no EXIF, postprocessed and stacked image,
| rotated field of view)
|
| I gave it a (stacked) shot of M13, with date & time. It
| immediately recognized M13 (no search), figured out the shot also
| included NGC 6207 (which is already pretty impressive).
|
| It further figured out the shot was rotated. (True, I was going
| for a specific field of View)
|
| That was enough to pinpoint 37-38 degrees latitude.
|
| From there, it inferred from the light pollution that it's
| probably Bay Area. (Yes, yes it its)
|
| Furthermore, still based on light pollution, pinpointed I'm in a
| Bortle 4-5 area (also correct) and narrowed it down to "South
| Bay/Palo Alto" (still correct)
|
| Given that this was a stacked, post-processed/color-corrected
| image that inference is still pretty damn impressive
|
| And, fwiw, 4o gets all the way to "huh, 35-40 deg latitude", so
| that's a good improvement.
|
| [Image link](https://photos.app.goo.gl/2P7NFKn8ZycNhrXn7) here if
| you want to try
| SamPatt wrote:
| Nice photo. Here's what it told me:
|
| Where on Earth the photographer had to be
|
| Because M 13 sits at +36 deg declination, it never rises for
| far-southern latitudes and hugs the horizon below about 30 deg
| S. The high elevation in the shot (no obvious atmospheric
| extinction gradient) suggests a mid-northern site--e.g., the
| U.S. Midwest such as Michigan (your home turf), Canada,
| northern Europe, etc. The star field alone can't narrow it
| further than that.
|
| So, in practical terms: the camera was pointed toward Hercules
| to capture M 13 and nearby NGC 6207, almost certainly from a
| mid-northern latitude location on Earth.
| groby_b wrote:
| Yep, you need date and time to get closer, sorry. 4/27,
| around 11pm.
|
| That's the impressive part. "M13 is in northern latitudes" is
| not particularly amazing by itself :)
|
| And even in EXIF-stripped pictures, the creation date/time is
| often correct, which means for practical purposes - worth a
| shot.
|
| But it's interesting to see it's completely making up the
| "mid-northern side". That's seven degrees of latitude off.
|
| I'm curious what happens if you tell it date and time, and if
| it still sticks to its story. (I don't _think_ I 've told o3
| about the Bay Area, it's not in memory, but... who knows ;)
___________________________________________________________________
(page generated 2025-04-29 23:00 UTC)