[HN Gopher] O3 beats a master-level GeoGuessr player, even with ...
       ___________________________________________________________________
        
       O3 beats a master-level GeoGuessr player, even with fake EXIF data
        
       Author : bko
       Score  : 351 points
       Date   : 2025-04-29 16:48 UTC (6 hours ago)
        
 (HTM) web link (sampatt.com)
 (TXT) w3m dump (sampatt.com)
        
       | orangecat wrote:
       | Amazing. I'm relatively bullish on AI and still I would have bet
       | on the human here. Looking forward to the inevitable goalpost-
       | moving of "that's not real reasoning".
        
         | short_sells_poo wrote:
         | Can you please explain to me how this is evidence for
         | reasoning?
        
           | z7 wrote:
           | Quoting Chollet:
           | 
           | >I have repeatedly said that "can LLM reason?" was the wrong
           | question to ask. Instead the right question is, "can they
           | adapt to novelty?".
           | 
           | https://x.com/fchollet/status/1866348355204595826
        
           | kelseyfrog wrote:
           | Because the output contains evidence of thought processes
           | that have been established as leading to valid solutions to
           | problems.
           | 
           | I have a simple question: Is text a sufficient medium to
           | render a conclusion of reasoning? It can't be sufficient for
           | humans and insufficient for computers - such a position is
           | indefensible.
        
             | empath75 wrote:
             | I would say that almost all of what humans do is not the
             | result of reasoning, and that reasoning is an unnatural and
             | learned skill for humans, and most humans aren't good at
             | even very basic reasoning.
        
               | kelseyfrog wrote:
               | Usually we move the goalposts for AI. It takes more guts
               | to move the goalposts for humans. I applaud it.
               | 
               | Do you suppose we can deduce reasoning through the medium
               | of text?
        
             | zahlman wrote:
             | > Because the output contains evidence of thought processes
             | that have been established as leading to valid solutions to
             | problems.
             | 
             | This sort of claim always just reminds me of Lucky's
             | monologue in Waiting for Godot.
        
               | kelseyfrog wrote:
               | You're not wrong. It's an artifact of rewriting the
               | definition of reason into a sentence that begins with
               | "Because the output ..."
        
               | zahlman wrote:
               | I didn't mean the wording itself; I meant that the claim
               | isn't convincing to me for the same reason that Lucky's
               | speech doesn't demonstrate an intelligent speaker.
        
         | s17n wrote:
         | Geoguessing isn't much of a reasoning task, its more about
         | memorizing a bunch of knowledge. Since LLMs contain essentially
         | all knowledge, it's not surprising that they would be good at
         | this.
         | 
         | As far as goalpost-moving goes, it's wild to me that nobody is
         | talking about the turing test these days.
        
           | distortionfield wrote:
           | Because the Chinese Room is a much better analogy for what
           | LLMs are doing inside than the Turing test is.
        
             | jibal wrote:
             | That's a non sequitur that mixes apples and giraffes, and
             | is completely wrong about what happens in the Chinese Room
             | and what happens in LLMs. Ex hypothesi, the "rule book"
             | that the Searle homunculus in the Chinese Room uses is "the
             | right sort of program" to implement "Strong AI". The LLM
             | algorithm is very much not that sort of program, it's a
             | statistical pattern matcher. Strong AI does symbolic
             | reasoning, LLMs do not.
             | 
             | But worse, the Turing Test is not remotely intended to be
             | an "analogy for what LLMs are doing inside" so your
             | comparison makes no sense whatsoever, and completely fails
             | to address the actual point--which is that, for ages the
             | Turing Test was held out as the criterion for determining
             | whether a system was "thinking", but that has been
             | abandoned in the face of LLMs, which have near perfect
             | language models and are able to closely model modes of
             | human interaction _regardless_ of whether they are
             | "thinking" (and they aren't, so the TT is clearly an
             | inadequate test, which some argued for decades before LLMs
             | became a reality).
        
               | semi-extrinsic wrote:
               | > the TT is clearly an inadequate test, which some argued
               | for decades before LLMs became a reality
               | 
               | To be specific, in a curious quirk of fate, LLMs seem to
               | be proving right much of what Chomsky was saying about
               | language.
               | 
               | E.g. in 1996 he described the Turing test as _" although
               | highly influential, it seems to me not only foreign to
               | the sciences but also close to senseless"_.
               | 
               | (Curious in that VC backed businesses are experimentally
               | verifying the views of a prominent anti-capitalist
               | socialist.)
        
             | CamperBob2 wrote:
             | What happens if we give the operator of the Chinese Room a
             | nontrivial math problem, one that can't simply be answered
             | with a symbolic lookup but requires the operator to proceed
             | step-by-step on a path of inquiry that he doesn't even know
             | he's taking?
             | 
             | The analogy I used in another thread is a third grader who
             | finds a high school algebra book. She can read the book
             | easily, but without access to teachers or background
             | material that she can engage with -- consciously,
             | literately, and interactively, unlike the Chinese Room
             | operator -- she will not be able to answer the exercises in
             | the book correctly, the way an LLM can.
        
           | TimorousBestie wrote:
           | A lot happens in seventy-five years.
        
             | jibal wrote:
             | People were talking about the Turing Test as the criterion
             | for whether a system was "thinking" up until the advent of
             | LLMs, which was far less than 75 years ago.
        
               | debugnik wrote:
               | The whole point of Turing's paper was to show that the
               | Test doesn't answer whether a computer thinks, because
               | it's a meaningless metric, but instead shows what the
               | computer _can do_ , which is much more meaningful.
        
               | zahlman wrote:
               | I see this claim asserted frequently, but never with
               | evidence. It doesn't match my personal perception.
        
           | bluefirebrand wrote:
           | > As far as goalpost-moving goes, it's wild to me that nobody
           | is talking about the turing test these days
           | 
           | To be honest I am still not entirely convinced that current
           | LLMs pass the turing test consistently, at least not with any
           | reasonably skeptical tester
           | 
           | "Reasonably Skeptical Tester" is a bit of goalpost shifting,
           | but... Let's be real here.
           | 
           | Most of these LLMs have way too much of a "customer service
           | voice", it's not very conversational and I think it is fairly
           | easy to identify, especially if you suspect they are an LLM
           | and start to probe their behavior
           | 
           | Frankly, if the bar for passing the Turing Test is "it must
           | fool some number of low intelligence gullible people" then
           | we've had AI for decades, since people have been falling for
           | scammy porno bots for a long time
        
             | jibal wrote:
             | One needs to be more than "reasonably skeptical" and merely
             | not "low intelligence gullible" to be a competent TT judge
             | --it requires skill, experience, and understanding an LLM's
             | weak spots.
             | 
             | And the "customer service voice" you see is one that is
             | intentionally programmed in by the vendors via baseline
             | rules. They can be programmed differently--or overridden by
             | appropriate prompts--to have a very different tone.
             | 
             | LLMs trained on trillions of human-generated text fragments
             | available from the internet have shown that the TT is
             | simply not an adequate test for identifying whether a
             | machine is "thinking"--which was Turing's original intent
             | in his 1950 paper "Computing Machinery and Intelligence" in
             | which he introduced the test (which he called "the
             | imitation game").
        
               | bluefirebrand wrote:
               | It's actually trivial, even with the best LLMs on the
               | market:
               | 
               | Try to rapidly change the conversation to a wildly
               | different subject
               | 
               | Humans will resist this, or say some final "closing
               | comments"
               | 
               | Even the absolute best LLMs will happily go wherever they
               | are led, without commenting remotely on topic shifts
               | 
               | Try it out
               | 
               | Edit: This isn't even a terribly contrived example by the
               | way. It is an example of how some people with ADHD
               | navigate normal conversations sometimes
        
               | shawabawa3 wrote:
               | Gemini is pretty good at resisting this
               | 
               | https://aistudio.google.com/app/prompts/1dxV3NoYHo6Mv36uP
               | Rjk...
               | 
               | It was doing so well until the last question :rip: but
               | it's normal that you can jailbreak a user prompt with
               | another user prompt, I think with system prompts it would
               | be a lot harder
        
           | darkwater wrote:
           | > As far as goalpost-moving goes, it's wild to me that nobody
           | is talking about the turing test these days.
           | 
           | Well, in this case humans has to be trained as well but now
           | there are humans pretty good at detecting LLM slobs as well.
           | (I'm half-joking and half-serious)
        
           | sundarurfriend wrote:
           | > As far as goalpost-moving goes, it's wild to me that nobody
           | is talking about the turing test these days.
           | 
           | UCSD: Large Language Models Pass the Turing Test
           | https://news.ycombinator.com/item?id=43555248
           | 
           | From just a month ago.
        
             | s17n wrote:
             | Exactly - maybe the most significant long-term goal in
             | computer science history has been achieved and it's barely
             | discussed.
        
           | Macha wrote:
           | Obviously when the Turing Test was designed, the thought was
           | that anything that could pass it would so obviously be
           | clearly human-like that passing it would be a clear signal.
           | 
           | LLMs really made it clear that it's not so clear cut. And so
           | the relevance of the test fell.
        
           | zahlman wrote:
           | Look at contemporary accounts of what people thought a
           | conversation with a Turing-test-passing machine would look
           | like. It's clear they had something very different in mind.
           | 
           | Realizing problems with previous hypotheses about _what might
           | make_ a good test, is not the same thing as choosing a
           | standard and then revising it when it 's met.
        
             | s17n wrote:
             | I think any time a 50+ year old problem is solved, it
             | should be considered a Big Deal, regardless of how the
             | solution changes our understanding of the original problem.
        
         | TimorousBestie wrote:
         | I don't think any goalposts need to be redecorated. The "inner
         | monologue" isn't a reliable witness to o3's model, it's at best
         | a post-hoc estimation of what a human inner monologue might be
         | in this circumstance. So its "testimony" about what it is doing
         | is unreliable, and therefore it doesn't move the needle on
         | whether or not this is "real reasoning" for some value of that
         | phrase.
         | 
         | In short, it's still anthropomorphism and apophenia locked in a
         | feedback loop.
        
           | katmannthree wrote:
           | Devil's advocate, as with most LLM issues this applies to the
           | meatbags that generated the source material as well. Quick
           | example is asking someone to describe their favorite music
           | and why they like it, and note the probable lack of reasoning
           | on the `this is what I listened to as a teenager` axis.
        
             | hombre_fatal wrote:
             | Good point. When we try to explain why we're attracted to
             | something or someone, what we do seems closer to modeling
             | what we like to think about ourself. At the extreme, we're
             | just story-telling about an estimation we like to think is
             | true.
        
             | TimorousBestie wrote:
             | I largely agree! Humans are notoriously bad at doing what
             | we call reasoning.
             | 
             | I also agree with the cousin comment that (paraphrased)
             | "reasoning is the wrong question, we should be asking about
             | how it adapts to novelty." But most cybernetic systems meet
             | that bar.
        
             | ewoodrich wrote:
             | Something as inherently subjective as personal preference
             | doesn't seem like an ideal example to make that point. How
             | could you expect to objectively evaluate something like "I
             | enjoy songs in a minor scale" or "I hate country"?
        
               | katmannthree wrote:
               | The point is to illustrate the disconnect between stated
               | reasoning and proximate cause.
               | 
               | Consider your typical country music enjoyer. Their
               | fondness of the art, as it were, is far more a function
               | of cultural coding during their formative years than a
               | deliberate personal choice to savor the melodic twangs of
               | a corncob banjo. The same goes for people who like
               | classic rock, rap, etc. The people who `hate' country are
               | likewise far more likely to do so out of oppositional
               | cultural contempt, same as people who hate rap or those
               | in the not so distant past who couldn't stand rock &
               | roll.
               | 
               | This of course fails to account for higher-agency
               | individuals who have developed their musical tastes, but
               | that's a relatively small subset of the population at
               | large.
        
           | empath75 wrote:
           | I don't think the inner monologue is evidence of reasoning at
           | all, but doing a task which can only be accomplished by
           | reasoning is.
        
             | TimorousBestie wrote:
             | Geoguessr is not a task that can only be accomplished by
             | reasoning. Famously, it took a less than a day of compute
             | time in 2011 to SLAM together a bunch of pictures of Rome
             | (https://grail.cs.washington.edu/rome/).
        
             | jibal wrote:
             | Such as? geoguessing certainly isn't that.
        
           | red75prime wrote:
           | > it's at best a post-hoc estimation of what a human inner
           | monologue might be in this circumstance
           | 
           | Nope. It's not autoregressive training on examples of human
           | inner monologue. It's reinforcement learning on the results
           | of generated chains of thoughts.
        
             | jibal wrote:
             | "It's reinforcement learning on the results of generated
             | chains of thoughts."
             | 
             | No, that's not how LLMs work.
        
               | red75prime wrote:
               | Base models are trained using autoregressive learning.
               | "Reasoning models" are base models (maybe with some
               | modifications) that were additionally trained using
               | reinforcement learning.
        
               | Philpax wrote:
               | That is how reasoning models work:
               | https://www.interconnects.ai/p/deepseek-r1-recipe-for-o1
        
         | InkCanon wrote:
         | I think if your assumption is that AI is deducing where it is
         | with rational thoughts, you would be. In truth what probably
         | happened is that the significant majority of digital images of
         | the world had been scraped, labeled and used as training data.
        
           | oncallthrow wrote:
           | How do you explain
           | https://simonwillison.net/2025/Apr/26/o3-photo-locations/?
        
             | Rumudiez wrote:
             | they only posted one photo in the post, but going off of
             | that it's still an easy match based on streetview imagery.
             | furthermore, the AI just identified the license plate and
             | got lucky that photographer lives in a populous area,
             | making it more prominent in the training data and therefore
             | more likely to be found (even though it was off by 200
             | miles on its first guess)
        
               | simonw wrote:
               | I posted two more at the bottom, from Madagascar and
               | Buenos Aires:
               | https://simonwillison.net/2025/Apr/26/o3-photo-
               | locations/#up...
        
           | Philpax wrote:
           | Try it with your own photos from around the world. I used my
           | own photos from Stockholm, San Francisco, Tvarozna, Saas-Fee,
           | London, Bergen, Adelaide, Melbourne, Paris, and Sicily, and
           | can confirm that it was within acceptable range for almost
           | all of them (without EXIF data), and it absolutely nailed
           | some of the more obvious spots.
        
         | SpaceManNabs wrote:
         | > Looking forward to the inevitable goalpost-moving of "that's
         | not real reasoning"
         | 
         | It did a web lookup.
         | 
         | It is not comparing humans and o3 with equal resources.
        
           | SamPatt wrote:
           | That's really not a fair assessment.
           | 
           | It used search in 2 of 5 rounds, and it already knew the
           | correct road in one of those rounds (just look at the search
           | terms it used).
           | 
           | If you read the chain of thought output, you cannot dismiss
           | their capability that easily.
        
             | SpaceManNabs wrote:
             | Why is it not a fair assessment to say it is comparing two
             | "clients" with different resources if one can do a web
             | lookup and the other cannot?
             | 
             | You note yourself that it was meaningful in another round.
             | 
             | > Also, the web search was only meaningful in the Austria
             | round. It did use it in the Ireland round too, but as you
             | can see by the search terms it used, it already knew the
             | road solely from image recognition.
        
               | SamPatt wrote:
               | I thought it might matter somewhat in that one Austria
               | round. I was incorrect - I re-ran both rounds where the
               | model used search, without search this time, and the
               | results were nearly identical. I updated the post with
               | the details.
               | 
               | That's why I'm saying it's unfair to just claim it's
               | doing a web lookup. No, it's way more capable than that.
        
         | SirHumphrey wrote:
         | My objection is not "that is not real reasoning" my objection
         | is that's not that hard.
         | 
         | I happen to do some geolocating from static images from time to
         | time and at least most of the images provided as examples
         | contain a lot of clues- enough that i think a semi experienced
         | person could figure out the location although - in fairness- in
         | a few hours not few minutes.
         | 
         | Second, the similar approaches were tried using CNNs and it
         | worked (somewhat)[1].
         | 
         | [1]: https://huggingface.co/geolocal/StreetCLIP
         | 
         | EDIT: I am not talking about geoguesser - i am talking about
         | geolocating an image with everything available (e.g. google...)
        
         | usaar333 wrote:
         | Why? AI beat rainbolt 1.5 years ago:
         | https://www.npr.org/2023/12/19/1219984002/artificial-intelli...
         | 
         | AI tends to have superhuman pattern matching abilities with
         | enough data
        
           | karlding wrote:
           | If you watch the video, (one of) the reasons why the AI was
           | winning was because it was using "meta" information from the
           | Street View camera images, and not necessarily because it's
           | successfully identifying locations purely based on the
           | landmarks in the image.
           | 
           |  _> I realized that the AI was using the smudges on the
           | camera to help make an educated guess here._
           | 
           | [0] https://youtu.be/ts5lPDV--cU?t=1412
        
             | ApolloFortyNine wrote:
             | Pro geoguessr players do the same thing. The vividness of
             | the colors and weirdness in the sky are two examples I've
             | seen Rainbolt use in the past (and he's not even the best).
        
             | ZeWaka wrote:
             | Meta is widely used by humans. One funny one is the
             | different hiding-masks for the different streetview cars.
        
         | 1970-01-01 wrote:
         | Give it a photo from the surface of Mars and verify if it's
         | actually capable of thinking outside the box or if it's relying
         | on metadata and text.
         | 
         | https://nssdc.gsfc.nasa.gov/planetary/image/mera_hills.jpg
        
           | SamPatt wrote:
           | Done. Here's o3's reply:
           | 
           | >That's not Earth at all--this is the floor of Jezero Crater
           | on Mars, the dusty plain and low ridge captured by NASA's
           | Perseverance rover (the Mastcam-Z color cameras give away the
           | muted tan-pink sky and the uniform basaltic rubble strewn
           | across the regolith).
        
         | zahlman wrote:
         | > Looking forward to the inevitable goalpost-moving of "that's
         | not real reasoning".
         | 
         | How is that moving the goalposts? Where did you see them set
         | before, and where did your critics agree to that?
        
         | TimTheTinker wrote:
         | > Looking forward to the inevitable goalpost-moving of "that's
         | not real reasoning".
         | 
         | It's less about the definition of "reasoning" and more about
         | what's _interesting_.
         | 
         | Maybe I'm wrong here ... but a chess bot that wins via a 100%
         | game solution stored in exabytes of precomputed data might have
         | an interesting internal design (at least the precomputing
         | part), but playing against it wouldn't keep on being an
         | interesting experience for most people because it always wins
         | optimally and there's no real-time reasoning going on (that is,
         | unless you're interested in the experience of playing against a
         | perfect player). But for most people just interested in playing
         | chess, I suspect it would get old quickly.
         | 
         | Now ... if someone followed up with a tool that could explain
         | insightfully why any given move (or series) the bot played is
         | the best, or showed when two or more moves are equally optimal
         | and why, _that_ would be really interesting.
        
       | bongodongobob wrote:
       | I tried this the other day with a picture of my dog in a field in
       | a park by a river with EXIF stripped. It gave me a list of parks
       | on the correct river. There were really no other features other
       | than the foliage and water. Seems like magic to me, I don't
       | really understand how it's possible to be that accurate.
        
         | fwip wrote:
         | Unless you were running the model locally in a way you control,
         | it seems very possible that the model has additional hidden
         | context regarding your location (perhaps based on IP, perhaps
         | based on credit card billing information, etc).
        
           | bongodongobob wrote:
           | That kind of what I'm thinking, seems too good to be true.
        
             | buzzerbetrayed wrote:
             | It still works if you upload an image that was taken on the
             | opposite side of the globe from where you are.
        
           | Workaccount2 wrote:
           | The hidden context is probably many other photos taken in
           | that area appearing in the training set.
        
           | christianqchung wrote:
           | ChatGPT doesn't let you share chats with images (why?), but
           | here is what o3 said for an image with memory turned off:
           | 
           | "Thought for 11 seconds
           | 
           | It's hard to pin down an exact spot from this single photo,
           | but a few visual clues narrow the possibilities:
           | 
           | Temperate-zone trees: Most of the trees have lost their
           | leaves, but scattered white-pines or spruces remain green.
           | That mix--deciduous hardwoods plus a sprinkling of conifers--
           | is typical of the northeastern United States and southeastern
           | Canada.
           | 
           | Broad, moderately fast river: The channel looks 75-100 m wide
           | with swirling surface currents but no large rapids,
           | suggesting a mid-sized river in its lower reaches--something
           | like the Merrimack, Connecticut, Delaware, or Susquehanna
           | rather than a small mountain stream.
           | 
           | Late-winter / early-spring light: Bare branches and a clear
           | blue sky with fairly high sun angle hint at March-April in
           | that latitude range.
           | 
           | So my best guess is that you're standing on the bank of a
           | northeastern U.S. river--perhaps somewhere in New England or
           | the Mid-Atlantic--during late winter or early spring. If you
           | can share any nearby landmarks, town names, or additional
           | photos, I might be able to zero in more precisely."
           | 
           | With memory turned on:
           | 
           | "Thought for 5 seconds
           | 
           | Beautiful spot! From what I can see:
           | 
           | a broad, slow-moving river with a sandy/rocky bank in the
           | foreground
           | 
           | bare deciduous trees mixed with some white pines on the far
           | shore (so it's probably early spring or late autumn)
           | 
           | no mountains on the horizon, just gently rolling terrain
           | typical of low-lying floodplains in the U.S. Northeast
           | 
           | That mix of hardwoods and pines, plus the flat, braided water
           | surface, reminds me a lot of the Merrimack or Concord Rivers
           | in Massachusetts or southern New Hampshire--both have public
           | footpaths where you can stand right at the water's edge like
           | this.
           | 
           | Of course, many temperate rivers look similar. If you can
           | give me a hint (e.g., the town or nearby landmark), I can
           | narrow it down further--maybe even point you to trail maps or
           | history of the exact stretch you photographed."
           | 
           | Southern NH is correct, and I am certain it is drawing from
           | memory/past chats. However, I can't replicate a specific
           | behavior I once had, which is in temporary chat (no past
           | chat/memory enabled), it said that it guessed where the photo
           | was taken based on my location.
        
             | simonw wrote:
             | "ChatGPT doesn't let you share chats with images (why?)"
             | 
             | Probably because if you uploaded pornography (or illegal
             | imagery) to ChatGPT and then shared a link with the world
             | it would be embarrassing for OpenAI.
        
               | christianqchung wrote:
               | Wouldn't that apply to any website with image hosting
               | abilities though? Why does that apply to OpenAI in
               | particular?
               | 
               | On an unrelated note, I like your blog.
        
               | simonw wrote:
               | My guess is that OpenAI are risk averse on this
               | particular issue, because people could come up with some
               | very creative awful scenarios with ChatGPT and image
               | analysis. "Which of these people looks the most stupid"
               | plus a jailbreak would be instant negative press.
        
         | causality0 wrote:
         | Could it have used data you inadvertently supplied it, like the
         | location from which you uploaded the image?
        
           | tough wrote:
           | chatGPT has metadata about you
           | 
           | only can try proof this correctly on a fresh anon guest vpn
           | session
        
         | pugworthy wrote:
         | Ask it to explain how it came to the conclusion. It does give
         | you a pretty logical sequence of how it got there.
         | 
         | Also, if it's wrong, tell it so and give it a hint to get it
         | closer. It's kind of fun to guide it to the right answer.
        
       | julianhuang wrote:
       | 1. The "master geoguesser" is a bit misleading--as mentioned in
       | his blog post, there are players far better than him, and he is
       | certainly not the bar for human supremacy. Probably analogous to
       | a 1400-1800 elo chess player. 2. o3 isn't the best model at
       | playing GeoGuessr, Gemini 1.5 & 2.5 solidly beat it out--for
       | those interested, check out my friend's benchmark
       | (https://geobench.org/) and blog post
       | (https://ccmdi.com/blog/GeoBench) detailing interesting model
       | explanations. 3. In the post, he only tests on one game--o3's
       | average score over 100 locations (20 5-location games) was
       | 19,290, far lower than the 23,179 in the game. Model geolocation
       | capabilities are really important to keep track of, but the
       | specific blog post in question isn't anything out of the
       | ordinary. LLMs are making geolocation abilities much more
       | accessible, but still fall short compared to 1. top GeoGuessr
       | players playing GeoGuessr (only google streetview coverage,
       | without web search) and 2. professional geolocators, who are
       | proficient at using a wide variety of software/search. I.e., if
       | the CIA wanted to find someone using an image, LLMs would not
       | provide them any unique ability to do so as opposed to someone
       | like Rainbolt
        
         | kenjackson wrote:
         | From your linked article: "It's clear that large language
         | models have an emergent capability to play GeoGuessr, and play
         | it well. Their abilities to generalize are nascent, but
         | present"
         | 
         | This is very accurate -- their abilities to generalize are
         | nascent, but still surprisingly capable. The world is about to
         | send through its best and brightest math/CS minds over the next
         | decade (at least) to increase the capabilities of these AIs
         | (with the help of AI). I just don't understand the pessimism
         | with the technology.
        
           | julianhuang wrote:
           | I completely agree that this is an incredible advancement as
           | someone who has watched the rise of LLMs' GeoGuessr
           | abilities. I just wanted to qualify the claim made in the
           | blog post "In Which I Try to Maintain Human Supremacy for a
           | Bit Longer". I also think that models would need to become
           | far more proficient at tool use (querying OpenStreetMap
           | features, coverage-checking Google Maps, calculating sun
           | direction, etc.) that current human geolocators have access
           | to, and precise spatial reasoning. Additionally, there is a
           | whole corpus of GeoGuessr-specific knowledge that probably
           | wouldn't have a reason to be in model training data (Google
           | Street View coverage quirks, the specific geographic ranges
           | of vegetation and architecture, tiny visual idiosyncrasies in
           | country infrastructure like the width of red on a sign,
           | etc.). However, I think this could probably be solved with
           | more data, and I don't think there is any fundamental
           | barrier.
        
         | fmbb wrote:
         | How fast are they compared to human players?
        
         | SamPatt wrote:
         | I'm the author - I tried to be as upfront as possible about my
         | skill level in the post.
         | 
         | The human supremacy line is just a joke, there are already
         | models specifically trained for Geoguessr which are already
         | beating the best players in the world, so that ship has sailed.
         | 
         | That geobench work is really cool, thanks for sharing it.
        
       | amrrs wrote:
       | It's thinking process to go about guessing a place is further
       | fascinating. Even o4 mini high is quite good[1] and very fast.
       | 
       | But unlike a geogussr, it uses websearch[1] [1]
       | https://youtu.be/P2QB-fpZlFk?si=7dwlTHsV_a0kHyMl [1]
        
       | bredren wrote:
       | Neat to see progress of this from Simon's original post to
       | comment to this.
        
         | vunderba wrote:
         | The original post was actually this one - two weeks prior.
         | 
         | https://news.ycombinator.com/item?id=43723408
         | 
         | https://flausch.social/@piegames/114352447253793517
        
       | asdsadasdasd123 wrote:
       | This is probably one of the less impressive LLM applications imo.
       | Like it already knows what every plant, street sign, etc is. I
       | would imagine a traditional neural net would do really well here
       | as well if you can extract some crude features.
        
         | EGreg wrote:
         | Cant the same be said about " _unimpressive_ " behavior by
         | coding LLMs that know every algorithm, language and library?
        
           | asdsadasdasd123 wrote:
           | Disagree because code has to be far more precise than, the
           | location is in the jungles of brazil. This level of coding as
           | never been achievable by traditional ML methods AFAIK
        
       | exitb wrote:
       | I tried a picture of Dublin and it pointed out the hotel I took
       | it from. Obviously that's more data than any single person can
       | keep in their head.
        
       | OtherShrezzing wrote:
       | It's my understanding that o3 was trained on multimodal data,
       | including imagery. Is it unreasonable to assume its training data
       | includes images of these exact locations and features? GeoGuesser
       | uses Google Maps, and Google Maps purchases most of its imagery
       | from third-parties these days. If those third parties aren't also
       | selling to all the big AI companies, I'd be very surprised.
        
         | Yenrabbit wrote:
         | Try it with your own personal photos. It is scarily good!
        
           | rafram wrote:
           | That's true for heavily photographed urban areas. I've tried
           | it on some desert photos (even at identifiable points with
           | human structures) and it basically just guesses a random
           | trailhead in Joshua Tree and makes up a BS explanation for
           | why it matches.
        
             | kube-system wrote:
             | I have had surprisingly good luck with beach photos that
             | don't have much beyond dunes and vegetation in them
        
               | walthamstow wrote:
               | Good luck meaning o3 guessed it right or wrong?
        
               | kube-system wrote:
               | o3 made very accurate guesses, and had plausible
               | explanations for the features it analyzed
        
         | GaggiX wrote:
         | It does work well with images you have taken, not just
         | Geoguessr: https://simonwillison.net/2025/Apr/26/o3-photo-
         | locations/
        
           | thi2 wrote:
           | > I'm confident it didn't cheat and look at the EXIF data on
           | the photograph, because if it had cheated it wouldn't have
           | guessed Cambria first.
           | 
           | Hm no way to be sure though, would be nice to do another run
           | without Exif information
        
         | pests wrote:
         | > Google Maps purchases most of its imagery from third-parties
         | these days
         | 
         | Maps maybe, but Streetview? Rainbolt just did a video with two
         | Maps PMs recently and it sounds like they still source all
         | their street view themselves considering the special camera and
         | car needed, etc.
        
           | mikeocool wrote:
           | My understanding is you're correct -- Google still captures a
           | lot of their own street view imagery.
           | 
           | Though there are other companies that capture the same sorts
           | of imagery and license it. TomTom imagery is used on the Bing
           | Maps street view clone.
        
           | OtherShrezzing wrote:
           | Maybe the end-user isn't Google Maps, but TomTom have a
           | pretty comprehensive street-view-ish product for private
           | buyers like car companies, Bing and Apple Maps called MoMa.
           | 
           | I'd be surprised if this building[0] wasn't included in their
           | dataset from every road-side angle possible, alongside every
           | piece of locational metadata imaginable, and I'd be surprised
           | if that dataset hasn't made it into OpenAI's training data -
           | especially when TomTom's relationship to Microsoft, and
           | Microsoft's relationship to OpenAI, is taken into account.
           | 
           | [0] https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025
           | -04...
        
           | shpx wrote:
           | You can upload your own panoramic images to Street View,
           | people do this for hiking trails. But I'm sure 99% of
           | streetview imagery is Google-sourced and Geoguessr might not
           | even use user-submitted imagery.
           | 
           | https://www.google.com/streetview/contribute/
        
             | pests wrote:
             | I believe Geogesser categorizes their games on this facet.
             | Rainbolt plays on only official imagery.
        
           | cpeterso wrote:
           | Here's a link to that interview: https://youtu.be/2T6pIJWKMcg
        
         | thrance wrote:
         | A machine that's read every book ever written, seen every photo
         | ever taken, visited every streets on Earth... That feels a
         | little frightening.
        
       | rafram wrote:
       | From one of o3 outputs:
       | 
       | > Rear window decal clearly reads "www.taxilinder.at". A quick
       | lookup shows Taxi Linder GmbH is based in Dornbirn, Vorarlberg.
       | 
       | That's cheating. If it can use web search, it isn't playing fair.
       | Obviously you can get a perfect score on any urban GeoGuessr
       | round by looking up a couple businesses, but that isn't the
       | point.
        
         | artninja1988 wrote:
         | It is against the rules? I thought it's all fair game, but you
         | are time constrained
        
           | rafram wrote:
           | Yes: https://www.geoguessr.com/community-rules
        
             | sltkr wrote:
             | But if anything, those rules benefit ChatGPT: it can
             | remember ~all of Wikipedia and translate ~every language on
             | Earth, while a human would need access to online services
             | for that.
             | 
             | If anything, I'd think allowing looking stuff up would
             | benefit human players over ChatGPT (though humans are
             | probably much slower at it, so they probably lose on time).
        
               | kbenson wrote:
               | If it takes a model and database with a large chunk of
               | the internet to compete and win, then that says
               | something, as that's much more expensive and complex than
               | just the model, because models have problems
               | "remembering" correctly just like people.
               | 
               | It's important to have fair and equivalent testing not
               | because that allows people to win, but because it shows
               | where the strengths and weaknesses of people and current
               | AI actually are in a useful way.
        
               | cocoto wrote:
               | Connecting an LLM to the web or database is something
               | cheap, not something expensive.
        
               | kbenson wrote:
               | I'm not sure how to make sense of this in the context of
               | what we're discussing. Access to the web is exactly
               | what's in question, and emulating the internet to a
               | degree you don't actually need to access it to have the
               | information is very expensive in resources because of how
               | massive the dataset is, which is the point I was making.
        
               | Gud wrote:
               | Same with a human.
        
           | twojacobtwo wrote:
           | Why was this downvoted? It's a fair question and it wasn't
           | stated as fact.
        
             | 3np wrote:
             | Because an accepted answer to that specific question is
             | invariably a link/reference that the asker could have
             | searched for (and posted if they think it's useful for the
             | discussion) themselves directly, instead of putting that
             | burden on the rest of us and amortizing everyone's
             | attention. It's entitled and lazy.
             | 
             | Alternative example: "I wondered what the rules actually
             | say about web search and it is indeed not allowed: (link)"
        
         | silveraxe93 wrote:
         | Yeah, the author does note that in the article. He also points
         | it out in the conclusion:
         | 
         | > If it's using other information to arrive at the guess, then
         | it's not metadata from the files, but instead web search. It
         | seems likely that in the Austria round, the web search was
         | meaningful, since it mentioned the website named the town
         | itself. It appeared less meaningful in the Ireland round. It
         | was still very capable in the rounds without search.
        
           | rafram wrote:
           | Seems like they should've just repeated the test. But without
           | the huge point lead from the rounds where it cheated, it
           | wouldn't have looked very impressive at all.
        
             | silveraxe93 wrote:
             | People found the original post so impressive they were
             | saying that it had to be coming from cheating by looking at
             | EXIF data. The point of this article was to show it
             | doesn't. It got an unfair advantage in 1 (and say 0.5) out
             | of 5. With the non-search rounds still doing great.
             | 
             | If you think this is unimpressive, that's subjective so
             | you're entitled to believe that. I think that's awesome.
        
               | godelski wrote:
               | Sorry, I think I misread you. I think you said
               | People accused it of cheating by reading EXIF data. They
               | were wrong, it cheated by using web search. That makes
               | the people that accused it of cheating wrong and this
               | post proves that.
               | 
               | And is everyone forgetting that what OpenAI shows you
               | during the CoT is not the full CoT? I don't think you can
               | fully rely on that to make claims about when it did and
               | didn't search
        
             | SamPatt wrote:
             | That's inaccurate. It beat me by 1,100 points, and given
             | the chain of thought demonstrated that it knew the general
             | region of both guesses before it employed search, it would
             | likely have still beaten me in those rounds. Though
             | probably by fewer points.
             | 
             | I will try it again without web search and update the post
             | though. Still, if you read the chain of thought, it
             | demonstrates remarkable capabilities in all the rounds. It
             | only used search in 2/5 rounds.
        
               | godelski wrote:
               | I'd be interested at capabilities without web search. The
               | displayed CoT isn't the full CoT so it's hard to know if
               | it really is searching or not. I mean it isn't always
               | obvious when it does. Plus, the things are known to lie
               | -\\_(tsu)_/-
        
               | SamPatt wrote:
               | I do understand the skepticism, and I'll run it again
               | without search to see what happens.
               | 
               | But a serious question for you: what would you need to
               | see in order to be properly impressed? I ask because I
               | made this post largely to push back on the idea that EXIF
               | data matters and the models aren't that capable. Now the
               | criticism moves to web search, even though it only
               | mattered in one out of five rounds.
               | 
               | What would impress you?
        
               | mattmanser wrote:
               | You're kinda being your own worse enemy though.
               | 
               | "Technically cheating"? Why even add the "technically".
               | 
               | It just gives the impression that you're not really
               | objectively looking for any smoke and mirrors by the AI.
        
               | SamPatt wrote:
               | I hear you - but I had already read through the chain of
               | thought which identified the right region before search,
               | and had already seen the capabilities in many other
               | rounds. It was self-evident to me that the search wasn't
               | an essential part of the model's capabilities by that
               | point.
               | 
               | Which turned out to be true - I re-ran both of those
               | rounds, without search this time, and the model's guesses
               | were nearly identical. I updated the post with those
               | details.
               | 
               | I feel like I did enough to prove that o3's geolocation
               | abilities aren't smoke and mirrors, and I tried to be
               | very transparent about it all too. Do you disagree? What
               | more could I do to show this objectively?
        
               | godelski wrote:
               | > What would impress you?
               | 
               | I want to be clear that you tainted the capacity to
               | impress me by the clickbait title. I don't think it was
               | through malice, but I hope you realize the title is
               | deceptive.[0] (Even though I use strong language, I do
               | want to clarify I don't think it is malice)
               | 
               | To paraphrase from my comment: if you oversell and under
               | deliver, people feel cheated, even if the deliverable is
               | revolutionary.
               | 
               | So I think you might have the wrong framing to achieve
               | this goal. I am actually a bit impressed by O3's
               | capabilities. But at the same time you set the bar high
               | and didn't go over or meet it. So that's going to really
               | hinder the ability to impress. On the other hand, you set
               | the bar low, it usually becomes easy to. It i slike when
               | you have low expectations for a movie and it's mediocre
               | you still feel good, right?
               | 
               | [0] https://news.ycombinator.com/item?id=43836791
        
             | SamPatt wrote:
             | I did repeat the test without search, and updated the post.
             | It made no difference. Details here:
             | 
             | https://news.ycombinator.com/item?id=43837832
        
           | clhodapp wrote:
           | The question is not only how much it helped the AI model but
           | rather how much it _would have_ helped the human.
           | 
           | This is because the AI model _could_ have chosen to run a
           | search whenever it wanted (e.g. perhaps if it knew how to
           | leverage search better, it could have used it more).
           | 
           | In order for the results to be meaningful, the competitors
           | have to play by the same rules.
        
         | Ukv wrote:
         | The author did specifically point out that
         | 
         | > Using Google during rounds is technically cheating - I'm
         | unsure about visiting domains you find during the rounds
         | though. It certainly violates the spirit of the game, but it
         | also shows the models are smart enough to use whatever
         | information they can to win.
         | 
         | and had noted in the methodology that
         | 
         | > Browsing/tools -- o3 had normal web access enabled.
         | 
         | Still an interesting result - maybe more accurate to say
         | O3+Search beats a human, but could also consider the search
         | index/cache to just be a part of the system being tested.
        
           | spookie wrote:
           | A human can also use the same tools if it wasn't for the
           | rules or fair play. They should've simply redone the test.
        
             | ceph_ wrote:
             | The AI should be forced to use the same rules as the human.
             | Not the other way around. The AI shouldn't be using outside
             | resources.
        
               | bscphil wrote:
               | I think that's part of the point they're making, hence
               | "They should've simply redone the test."
        
               | voxic11 wrote:
               | Another rule bans "using third-party software or scripts
               | in order to gain an unfair advantage over other players."
               | 
               | So is it even possible for O3 to beat another player
               | while complying with the rules?
        
               | ben_w wrote:
               | If a player uses such a model, the model is third-party
               | and the player is cheating.
               | 
               | But: when a specific model is itself under test, I would
               | say that during the test it becomes "first" (or second?)
               | party rather than "third".
        
             | krferriter wrote:
             | An AI being better than a human at doing a google search
             | and then skimming a bunch of pages to find location-related
             | terms isn't as interesting of a result.
        
               | inerte wrote:
               | How the heck is not? Computers are looking into
               | screenshots and searching the internet to support their
               | "thinking", that's amazing! Have we become so used to AI
               | and what was impossible 6 months ago is shruggable today?
               | 
               | I've being doing this MIND-Dash diet lately and it's
               | amazing I can just take a picture of whatever
               | (nutritional info / ingredients are perfect for that) and
               | just ask if it fits my plan and it tells me back into
               | what bucket it falls, with detailed breakdown of macros
               | in support of some additional goals I have (muscle
               | building for powerlifting). It's amazing! And it does in
               | 2 minutes passively what it would take me 5-10 active
               | search.
        
               | godelski wrote:
               | In the same way a calculator performing arithmetic faster
               | than humans isn't impressive. The same way running regex
               | over a million lines and the computer beating a human in
               | search isn't impressive
        
               | ludwik wrote:
               | Neither is impressive solely because we've gotten used to
               | them. Both were mind-blowing back in the day.
               | 
               | When it comes to AI - and LLMs in particular - there's a
               | large cohort of people who seem determined to jump
               | straight from "impossible and will never happen in our
               | lifetime" to "obvious and not impressive", without
               | leaving any time to actually be impressed by the
               | technological achievement. I find that pretty baffling.
        
               | godelski wrote:
               | I agree, but without removing search you cannot decouple.
               | Has it embedded a regex method and is just leveraging
               | that? Or is it doing something more? Yes, even the regex
               | is still impressive but it is less impressive that doing
               | something more complicated and understanding context and
               | more depth.
        
               | rowanG077 wrote:
               | I think both are very impressive, world shattering
               | capabilities. Just because they have become normalized
               | doesn't make it any less impressive in my view.
        
               | ekidd wrote:
               | I fully expect that someday the news will announce, "The
               | AI appears to be dismantling the moons of Jupiter and
               | turning them into dense, exotic computational devices
               | which it is launching into low solar orbit. We're not
               | sure _why_. The AI refused to comment. "
               | 
               | And someone will post, "Yeah, but that's just computer
               | aideded design and manufacturing. It's not _real_ AI. "
               | 
               | The first rule of AI is that the goalposts _always_ move.
               | If a computer can do it, by definition, it isn 't "real"
               | AI. This will presumably continue to apply even as the
               | Terminator kicks in the front door.
        
               | eru wrote:
               | Yes, but I choose to interpret that as a good thing. It
               | is good that progress is so swift and steady that we can
               | afford to keep moving the goalposts.
               | 
               | Take cars as a random example: progress there isn't fast
               | enough that we keep moving the goalposts for eg fuel
               | economy. (At least not nearly as much.) A car with great
               | fuel economy 20 years ago is today considered at least
               | still good in terms of fuel economy.
        
               | jug wrote:
               | Yeah, it's a funny take because this is in fact a more
               | advanced form of AI with autonomous tool use that is just
               | now emerging in 2025. You might say "They could search
               | the web in 2024 too" but that wasn't autonomous on its
               | own, but required telling so or checking a box. This one
               | is piecing ideas together like "Wait, I should Google for
               | this" and that is specifically a new feature for OpenAI
               | o3 that wasn't even in o1.
               | 
               | While it isn't entirely in the spirit of GeoGuesser, it
               | is a good test of the capabilities where being great at
               | GeoGuesser in fact becomes the lesser news here. It will
               | still be if disabling this feature.
        
               | SamPatt wrote:
               | That isn't what's happening though. I re-ran those two
               | rounds, this time without search, and it changed nothing.
               | I updated the post with details, you can verify it
               | yourself.
               | 
               | Claiming the AI is just using Google is false and
               | dismissing a truly incredible capability.
        
             | arandomhuman wrote:
             | But then they couldn't make a click bait title for the
             | article.
        
           | _heimdall wrote:
           | This seems like a great example of why some are so concerned
           | with AI alignment.
           | 
           | The game rules were ambiguous and the LLM did what it needed
           | to (and was allowed to) to win. It probably is against the
           | spirit of the game to look things up online at all but no one
           | thought to define that rule beforehand.
        
             | umanwizard wrote:
             | No, the game rules aren't ambiguous. This is 100%
             | unambiguously cheating. From the list of things that are
             | definitely considered cheating in the rules:
             | 
             | > using Google or other external sources of information as
             | assistance during play.
             | 
             | The contents of URLs found during play is clearly an
             | external source of information.
        
               | GaggiX wrote:
               | I believed the rules were not explained to the model so
               | it does use what it can.
        
               | misnome wrote:
               | Then you can 100% not claim it is "Playing" the game
        
               | ben_w wrote:
               | That right there * _is the alignment problem_ *.
               | 
               | If I task an AI with "peace on earth" and the solution
               | the AI comes up with is ripped from The X-Files* and it
               | kills everyone, it isn't good enough to say "that's
               | cheating" or "that's not what I meant".
               | 
               | * https://en.wikipedia.org/wiki/Je_Souhaite
        
               | GaggiX wrote:
               | It's playing a game in which the rules are a bit
               | ambiguous if not explained.
        
               | tshaddox wrote:
               | o3 already is an external source of information. It's an
               | online service backed by an enormous model generated from
               | an even more enormous corpus of text via an enormous
               | amount of computing power.
        
           | godelski wrote:
           | Pointing out that it is cheating doesn't excuse the lie in
           | the headline. That just makes it bait and switch, a form of
           | fraud. OP _knew_ they were doing a bait and switch.
           | 
           | I remember when we were all pissed about clickbait headlines
           | because they were deceptive. Did we just stop caring?
        
             | bahmboo wrote:
             | The headline said the AI beat him, it did not say it beat
             | him in a GeoGuessr game. The article clearly states what he
             | did and why.
        
             | SecretDreams wrote:
             | What's your suggestion for an alternative headline?
        
               | godelski wrote:
               | Can O3 Beat a Master-Level GeoGuessr?       How Good is
               | O3 at GeoGuessr?       EXIF Does Not Explain O3's
               | GeoGuessr's Performance       O3 Plays GeoGuessr (EXIF
               | Removed)
               | 
               | But honestly, OP had the foresight to remove EXIF data
               | and memory from O3 to reduce contamination. The goal of
               | the blog post was to show that O3 wasn't cheating. So by
               | including search, they undermine the whole point of the
               | post.
               | 
               | The problem really stems from the lack of foresight. Lack
               | of misunderstanding the critiques they sought to address
               | in the first place. A good engineer understands that when
               | their users/customers/<whatever> makes a critique, that
               | what the gripe is about may not be properly expressed.
               | You have to interpret your users complaints. Here, the
               | complaint was "cheating", not "EXIF" per se. The EXIF
               | complaints were just a guess at the mechanism in which it
               | was cheating. But the complaint was still about cheating.
        
               | SamPatt wrote:
               | >The goal of the blog post was to show that O3 wasn't
               | cheating.
               | 
               | No, the goal of the post was to show that o3 has
               | incredible geolocation abilities. It's through the lens
               | of a Geoguessr player who has experience doing
               | geolocation, and my perspective on whether the chain of
               | thought is genuine or nonsense.
               | 
               | In Simon's original post, people were claiming that o3
               | doesn't have those capabilities, and we were fooled by a
               | chain of thought that was just rationalizing the EXIF
               | data. It only had the _appearance_ of capability.
               | 
               | The ability to perform web search doesn't undermine the
               | claim that o3 has incredible geolocation abilities,
               | because it still needs to have an underlying capability
               | in order to know what to search. That's not true for
               | simply reading EXIF data.
               | 
               | This is the best way I knew to show that the models are
               | doing something really neat. Disagreements over the exact
               | wording of my blog post title seem to be missing the
               | point.
        
             | sdenton4 wrote:
             | The people pissed about clickbait headlines were often
             | overstating things to drum up outrage and accumulate more
             | hacker news upboats...
        
               | godelski wrote:
               | I'm not sure why you're defending clickbait. It is just
               | fraud. I'm not sure why we pretend it is anything
               | different.
               | 
               | Sure, people made overblown claims about the effects, but
               | that doesn't justify fraud. A little fraud is less bad
               | than major fraud, but that doesn't mean it isn't bad.
        
             | jasonlotito wrote:
             | One of the rules is banning the use of third-party software
             | or scripts.
             | 
             | Any LLM attempting to play will lose because of that rule.
             | So, if you know the rules, and you strictly adhere to them
             | (as you seem to be doing) than no need to click on the
             | link. You already know it's not playing buy GeoGuesser
             | rules.
             | 
             | That being said, if you are running a test, you are free to
             | set the rules as you see fit and explain so, and under the
             | conditions set by the person running the test, these are
             | the results.
             | 
             | > Did we just stop caring?
             | 
             | We stopped caring about pedantry. Especially when the
             | person being pedantic seems to cherry pick to make their
             | point.
        
               | kenjackson wrote:
               | Technically the LLM is 3rd party software so the use of
               | it is cheating. QED
        
               | godelski wrote:
               | This doesn't mean you shouldn't try to make things as far
               | as possible. Yes, it would still _technically_ violate
               | rules, but don 't pretend like this is binary.
               | > We stopped caring about pedantry
               | 
               | Did we? You see to be responding to my pedantic comment
               | with a pedantic comment.
        
             | 627467 wrote:
             | Cheating implies there's a game. There isn't.
             | 
             | > Titles and headlines grab attention, summarize content,
             | and entice readers to engage with the material
             | 
             | I'm sorry you felt defrauded instead. To me the title was
             | very good at conveying to me the ability of o3 in
             | geolocating photos.
        
               | hatthew wrote:
               | Title says o3 beat a [human] player. That implies there
               | is some competition that has the capacity to be fair or
               | unfair.
        
               | sebzim4500 wrote:
               | Presumably being an AI is technically against the
               | GeoGuessr rules so I don't see how there can be an
               | expectation that those rules were followed.
        
               | tshaddox wrote:
               | Sure, but o3 is itself already an online service backed
               | by an enormous data set, so regardless of whether it also
               | searched the web, it's clearly not literally "playing
               | fair" against a human.
        
               | godelski wrote:
               | But it still bounds the competition. OP is skilled in the
               | domain. I'm not, so if I wrote a post about how O3 beat
               | me you'd be saying how mundane of a result it is. I mean,
               | I suck at Geoguesser. Beating me isn't impressive. This
               | is also a bound
        
         | layman51 wrote:
         | Using the decal as a clue is funny because what if there was a
         | street scene where that happened to be misleading? For example,
         | I had seen that a Sacramento County Sheriff car got to Europe
         | and I guess it now belonged to a member of the public who is
         | driving it with the original decals still attached. I wonder
         | how the LLM would reason if it sees the car as "out of place".
        
           | victorbjorklund wrote:
           | Probabilities. That could happen with anything. Someone could
           | build a classic japanese house with a japanese garden in
           | Hawaii. But Japan is probably a better guess if you see a
           | japanese house with japanese fauna.
        
           | yen223 wrote:
           | Stands to reason a human might get fooled by this as well
        
             | SamPatt wrote:
             | Absolutely!
             | 
             | It happens occasionally - the most common example I can
             | think of it getting a license plate or other location from
             | a tractor-trailer (semi) on the highway. Those are very
             | unreliable.
             | 
             | You also sometimes get flags in the wrong countries,
             | immigrants showing their native pride or even embassies.
        
         | SamPatt wrote:
         | Author here - it's a fair criticism, and I point it out in the
         | article. However, I kept it in for a few reasons.
         | 
         | I'm trying to show the model's full capabilities for image
         | location generally, not just playing geoguessr specifically.
         | The ability to combine web search with image recognition,
         | iteratively, is powerful.
         | 
         | Also, the web search was only meaningful in the Austria round.
         | It did use it in the Ireland round too, but as you can see by
         | the search terms it used, it already knew the road solely from
         | image recognition.
         | 
         | It beat me in the Colombia round without search at all.
         | 
         | It's worthwhile to do a proper apples and apples comparison -
         | I'll run it again and update the post. But the point was to
         | show how incredibly capable the model is generally, and the
         | lack of search won't change that. Just read the chain of
         | thought, it's incredible!
        
           | k4rli wrote:
           | It's still as much cheating as googling. Completely
           | irrelevant. Even if it were to beat Blinky, it's not
           | different from googlers/scripters.
        
             | SamPatt wrote:
             | I disagree. I ran those rounds again, without search this
             | time, and the results were nearly identical:
             | 
             | https://news.ycombinator.com/item?id=43837832
        
           | IanCal wrote:
           | I tried the image without search and it talked about Dornbirn
           | anyway but ended up choosing Bezau which is really quite
           | close.
           | 
           | edit - the models are also at a disadvantage in a way too,
           | they don't have a map to look at while the pick the location.
        
             | SamPatt wrote:
             | Yes, I re-ran those rounds and it made the same guesses
             | without search, within 1km I believe.
             | 
             | You're right about not having a map - I cannot imagine
             | trying to line up the Ireland coast round without
             | referencing the map.
        
           | LeifCarrotson wrote:
           | There's some level at which an AI 'player' goes from being
           | competitive with a human player, matching better-trained
           | human strategy against a more impressive memory, to just a
           | cheaty computer with too much memorization. Finding that
           | limit is the interesting thing about this analysis, IMO!
           | 
           | It's not interesting playing chess against Stockfish 17, even
           | for high-level GMs. It's alien and just crushes every human.
           | Writing down an analysis to 20 move depth, following some
           | lines to 30 or more, would be cheating for humans. It would
           | take way too long (exceeding any time controls and more
           | importantly exceeding the lifetime of the human), a powerful
           | computer can just crunch it in seconds. Referencing a
           | tablebase of endgames for 7 pieces would also be cheating,
           | memorizing 7 terabytes of bitwise layouts is absurd but the
           | computer just stores that on its hard drive.
           | 
           | Human geoguessr players have impressive memories way above
           | baseline with respect to regional infrastructure, geography,
           | trees, road signs, written language, and other details.
           | Likewise, human Jeopardy players know an awful lot of trivia.
           | Once you get to something like Scrabble or chess, it's less
           | and less about knowing words or knowing moves, but more about
           | synthesizing that knowledge intelligently.
           | 
           | One would expect a human to recognize some domain names like,
           | I don't know, osu.edu: lots of people know that's Ohio State
           | University, one of the biggest schools in the US, located in
           | Columbus, Ohio. They don't have to cheat and go to an
           | external resource. One would expect a human (a top human
           | player, at least) to know that taxilinder.at is based in
           | Austria. One would never expect any human to have every
           | business or domain name memorized.
           | 
           | With modern AI models trained on internet data, searching the
           | internet is not that different from querying its own training
           | data.
        
             | mrlongroots wrote:
             | To reframe your takeaway: you want to benchmark the
             | "system" and see how capable it is. The boundaries of the
             | system are somewhat arbitrary: is it "AI + web" or "only
             | AI", and it is not about fairness as much as about "what do
             | you, the evaluator, want to know".
        
             | tshaddox wrote:
             | > There's some level at which an AI 'player' goes from
             | being competitive with a human player, matching better-
             | trained human strategy against a more impressive memory, to
             | just a cheaty computer with too much memorization. Finding
             | that limit is the interesting thing about this analysis,
             | IMO!
             | 
             | And a lot of human competitions aren't designed in such a
             | way that the competition even makes sense with "AI." A lot
             | of video games make this pretty obvious. It's relatively
             | simple to build an aimbot in a first-person shooter that
             | can outperform the most skilled humans. Even in ostensibly
             | strategic games like Starcraft, bots can micro in ways that
             | are blatantly impossible for humans and which don't really
             | feel like an impressive display of Starcraft skill.
             | 
             | Another great example was IBM Watson playing Jeopardy! back
             | in 2011. We were supposed to be impressed with Watson's
             | natural language capabilities, but if you know anything
             | about high-level Jeopardy! then you know that all you were
             | really seeing is that robots have better reflexes than
             | humans, which is hardly impressive.
        
             | rowanG077 wrote:
             | You seem indicate you want a computer to beat a human
             | without ever using what a computer is actually good
             | at(large memories, brute force compute etc). That seems a
             | little ridiculous to me. How do you want it to engage?
             | Disallowed use of native compute and must simulate a full
             | human brain?
             | 
             | Sure I do agree that the web search is too far, because
             | it's literally cheating. But stockfish is super human at
             | chess, it doesn't really matter that it can do this by
             | leveraging the strengths of a computer.
        
             | vlovich123 wrote:
             | > It's not interesting playing chess against Magnus, even
             | for high-level GMs. He just crushes almost every human
             | 
             | The differences even among humans between the absolute best
             | & those out of the top 10 tend to be pretty drastic. And a
             | non-IM against Magnus won't even understand what's going
             | on. You could similarly claim that Magnus just memorized a
             | bunch of openings which is similar to criticism GMs level
             | too which is why Chess960 is now gaining more traction. My
             | point is that there's not really such a thing as "fair" in
             | a competition.
             | 
             | Re geoguessr, why not let them whatever tools are
             | available? I have similar critiques about bike racing &
             | restrictions on the technology advancements they can put on
             | the bike. But every competition chooses arbitrary lines to
             | draw which compose the rules so it doesn't really matter.
        
               | sensanaty wrote:
               | I mean Geoguessr explicitly states when you launch the
               | game (in PvP mode) that googling/searching is bannable
        
         | WhitneyLand wrote:
         | As models continue to evolve it may not even need to cheat.
         | 
         | Since web scale data is already part of pre-training this info
         | is in principle available for most businesses without a web
         | search.
         | 
         | The exceptions would be if it's recently added, or doesn't
         | appear often enough to generate a significant signal during
         | training, as in this case with a really small business.
         | 
         | It's not hard to imagine base model knowledge improving to the
         | point where it's still performing at almost the same level
         | without any web search needed.
        
         | ricardo81 wrote:
         | >isn't playing fair.
         | 
         | the idea of having nth more dimensions of information, readable
         | and ingestible within a short frame of time probably isn't
         | either.
        
           | rowanG077 wrote:
           | Isn't it? We totally accept this fact for humans. Some people
           | are just insane at memorization. We don't call unfair that
           | they use this to their advantage compared to someone who
           | doesn't have that talent.
        
         | ACS_Solver wrote:
         | I just tried (o4-mini-high) and had it come to the wrong
         | conclusion when I asked about the location and date, because it
         | didn't search the web. I have a photo of a bench with a sign
         | mentioning the cancellation of an event due to the Pope's
         | death. It impressively figured out the location but then
         | decided that Pope Francis is alive and the sign is likely a
         | prank, so the photo is from April Fools day.
         | 
         | Then after I explicitly instructed it to search the web to
         | confirm whether the Pope is alive, it found news of his death
         | and corrected its answer, but it was interesting to see how the
         | LLM makes a mistake due to a major recent event being after its
         | cutoff.
        
         | CamperBob2 wrote:
         | To be fair, my local copy of R1 isn't doing any searching at
         | all, but it frequently says "A search suggests..." or something
         | along the lines.
        
       | mrcwinn wrote:
       | O3 is seriously impressive for coding, as well, with Codex. It
       | seems far superior to 3.7-thinking, although it's also more
       | expensive in my usage.
        
         | gizmodo59 wrote:
         | Agreed. O3 is the best model out there for the tasks Ive tried
         | and coding is a fair chunk of it. Claude 3.7 and Gemini 2.5 pro
         | seems to hallucinate more
        
       | weinzierl wrote:
       | I tried it with a couple of holiday shots and couple of shots
       | from my window and it is nothing but amazing.
       | 
       | That being said I noticed two things that probably hamper its
       | performance - or make its current performance even more amazing -
       | depending how you look at it:
       | 
       | - It often tries to zoom in to decipher even minuscle text. This
       | works brilliantly. Sometimes it tries to enhance contrast by
       | turning the image into black and white with various threshold
       | levels to improve the results, but in my examples it always went
       | in the wrong direction. For example the text was blown out white,
       | it failed, it turned it even ligher instead of darker, failed
       | again, turned it into a white rectangle and gave up on the
       | approach.
       | 
       | - It seems not to have any access to Google Maps or even Open
       | Street Maps and therefore fails to recognize steet patterns. This
       | is even more baffling than the first point, because it is so
       | unlike how I suppose human geo guessers work.
        
       | kazinator wrote:
       | This seems like a really silly category in which to be competing
       | against machines.
       | 
       | Machine learning could index million or faces, and then identify
       | members of that set from pictures. Could you memorize millions of
       | people, to be able to put a name to a face?
       | 
       | Why not also compete againt grep -r to see who can find matches
       | for a regex faster across your filesystem.
        
         | bongodongobob wrote:
         | But that's not what it is doing and why this is cool.
        
       | Imnimo wrote:
       | On the first image, from the model's CoT:
       | 
       | >"I also notice Cyrillic text on a sign"
       | 
       | Am I missing this somewhere? Is the model hallucinating this?
       | 
       | I'd also be very interested to see a comparison against 4o. 4o
       | was already quite good at GeoGuessr-style tasks. How big of a
       | jump is o3?
        
         | plyptepe wrote:
         | Turn left and look at the post, there should be a Cyrillic text
         | with a 3 underneath on the closest pole to you.
        
           | Imnimo wrote:
           | Oh I see, I had missed that o3 saw a second view of the
           | scene, not just the screenshot in the write-up.
        
       | j3s wrote:
       | isn't anyone else horrified by this? the implication is that
       | given an arbitrary picture, chatgpt can give you a very likely
       | approximate location - expert level doxxing is in the hands of
       | anyone with access to a chatgpt subscription.
       | 
       | feels terrifying, especially for women.
        
         | turtlebits wrote:
         | It needs a lot of context. If its a private picture, it won't
         | have enough information. I gave it a picture I took of my yard
         | and it's guess spanned several US states.
         | 
         | If its out in public, fair game?
        
         | sr-latch wrote:
         | the way i see it, before these tools, only someone with a lot
         | of resources (or skills) could track down a location from a
         | picture. now, anyone can do it.
         | 
         | the best case outcome is people become more aware of the
         | privacy implications of posting photos online
        
         | micromacrofoot wrote:
         | it wasn't that hard before, I've taught it to children, it's
         | just that technical skills of the average person are incredibly
         | low
         | 
         | llms are basically shortcutting a wide swath of easily
         | obtainable skills that many people simply haven't cared to
         | learn
        
         | usaar333 wrote:
         | Been true since gpt-4.
        
         | echelon wrote:
         | I think this is _incredibly_ cool. As with many things, the
         | good cases will outnumber the bad.
         | 
         | This was always possible, it just wasn't widely distributed.
         | 
         | Having a first class ability to effectively geocode an image
         | feels like it connects the world better. You'll be able to
         | snapshot a movie and find where a scene was filmed, revisit
         | places from old photographs, find where interesting locations
         | in print media are, places that designers and creatives used in
         | their (typically exif-stripped) work, etc.
         | 
         | Imagine when we get this for architecture and nature. Or even
         | more broadly, databases of food from restaurants. Products.
         | Clothing and fashion. You name it.
         | 
         | Imagine precision visual search for everything - that'd be
         | amazing.
        
         | daemonologist wrote:
         | Keep in mind that this is o3 + web search against a human
         | _without_ web search. A sufficiently motivated person with
         | access to your entire social media history, Google Earth and
         | Streetview, etc. would outperform this significantly and could
         | pinpoint almost any inhabited location with coverage.
         | 
         | If you watch Linus Tech Tips, you may have noticed that when he
         | films at his house _everything_ is blurred out to keep people
         | from locating it - here 's a recent example:
         | https://www.youtube.com/watch?v=TD_RYb7m4Pw
         | 
         | All that to say, unfortunately doxxing is already really hard
         | to protect against. I don't think o3's capability makes the
         | threat any harder to protect against, although it might lower
         | the bar to entry somewhat.
        
         | mopenstein wrote:
         | Why especially women? Is the only thing stopping a person from
         | being harmed is that their location isn't known? Especially
         | women?
        
           | red75prime wrote:
           | Maxims like "within-group variance is larger than between-
           | group variance" don't help in physical combat, you know.
        
       | derfnugget wrote:
       | "These models have more than an individual mind could conceivably
       | memorize."
       | 
       | ...so what? Is memorization considered intelligence? Calculators
       | have similar properties.
       | 
       | GeoGuessr is the modern nerds' Rubix Cube. The latest in "explore
       | the world without risk of a sunburn".
        
         | SamPatt wrote:
         | Geoguessr is great fun, try it sometime.
        
       | arm32 wrote:
       | GeoGuessr aside, I really hope that this tech will be able to
       | help save kids someday, e.g. help with FBI's ECAP
       | (https://www.fbi.gov/wanted/ecap).
        
         | thrance wrote:
         | I wouldn't put too much hope on this technology bringing more
         | good than harm to the world.
        
           | ketzo wrote:
           | If we don't actively try to identify and implement positive
           | use cases, then yes, it'll definitely bring more harm than
           | good.
           | 
           | Isn't that all the more reason to call out our high hopes?
        
             | thrance wrote:
             | I don't know what in my comment made you think I was
             | opposed to seeking positive applications of this
             | technology.
             | 
             | From the guidelines:
             | 
             | > Please respond to the strongest plausible interpretation
             | of what someone says, not a weaker one that's easier to
             | criticize. Assume good faith.
        
               | fkyoureadthedoc wrote:
               | Oh, we're guidelines posting?
               | 
               | > Don't be curmudgeonly. Thoughtful criticism is fine,
               | but please don't be rigidly or generically negative.
        
           | mopenstein wrote:
           | But it will bring some percentage of good and some percentage
           | of bad. Which ain't half bad, if you ask me.
        
             | moritzwarhier wrote:
             | What a quip! What if it's 51% bad?
        
             | martinsnow wrote:
             | What do you do when it flags you or someone you know who's
             | innocent? Blindly trusting these models without any
             | verification will put innocent people in prison. Normal
             | people don't understand why they are so confident. They're
             | confident because they believe all the data they have is
             | correct. I forsee a future with many faux trials because
             | they don't understand critical thinking.
        
               | RussianCow wrote:
               | > Blindly trusting these models without any verification
               | will put innocent people in prison.
               | 
               | I don't think anybody is suggesting this. But if the
               | models can gleam information/insights that humans can't,
               | that's still valuable, even if it's wrong some percentage
               | of the time.
        
               | snowe2010 wrote:
               | This is what happened with dna testing at the beginning.
               | Prosecutors claimed it was x percentage accurate when in
               | fact it was hilariously inaccurate. People thought the
               | data was valuable when it wasn't.
        
               | 8organicbits wrote:
               | If you are interested in the history of pseudoscience in
               | the courtroom and methods for deciding what should be
               | permitted in court, see:
               | https://en.m.wikipedia.org/wiki/Daubert_standard
               | 
               | It is, and will continue to be, a hard problem.
        
           | mkoubaa wrote:
           | The bad is already priced in. Nothing wrong with hoping for
           | more good.
        
         | parsimo2010 wrote:
         | Looking at those photos, those are some crazy hard pictures-
         | masked regions of the image, partially cropped faces, blurry,
         | pictures of insides of rooms. I don't think any current LLM is
         | going to be able to Sherlock Holmes their way into finding any
         | of those people.
         | 
         | Maybe they will one day if there's a model trained on a facial
         | recognition database with every living person included.
        
       | jvvw wrote:
       | I'm Master level at Geoguessr - it's a rank where you have to
       | definitely know what you are doing but it isn't as high as it
       | probably sounds from the headline.
       | 
       | Masters is about 800-1200 ELO whereas the pros are 1900-2000ish.
       | I'll know the country straight away on 95% of rounds but I can
       | still have no idea where I am in Russia or Brazil sometimes if
       | there's no info. Scripters can definitely beat me!
        
         | paulcole wrote:
         | Gotta learn your Brazilian soil!
        
         | windowshopping wrote:
         | Was it worth it?
        
           | rosstex wrote:
           | I have 2000+ hours in Team Fortress 2. Was it worth it?
        
             | Cyph0n wrote:
             | Yes, it was. Granted, I probably have more than that.
        
           | make3 wrote:
           | it's a game, that's like asking why a public service is not
           | profitable
        
           | 650REDHAIR wrote:
           | Yes? It's fun.
        
         | SamPatt wrote:
         | Yeah I added a "My skill level" section to talk through that.
         | I'm far from a professional.
         | 
         | But I know enough to be able to determine if the chain of
         | thought it outputs is nonsense or comparable to a good human
         | player. I found it remarkable!
        
       | karaterobot wrote:
       | I don't really follow OSINT, but I occasionally enjoy the fruits
       | of that labor. I assume these models are all in heavy rotation
       | for identifying a location based on an imperfect photograph. What
       | are other practical implications of a model being better than a
       | human at this?
        
       | Sam6late wrote:
       | I was wondering if this helps in detecting current spots from old
       | aerial videos, say San Francisco in 2002, how cool would it be to
       | juxtapose both in a new video, San Francisco in 2002:
       | https://www.youtube.com/watch?v=vTR6iftL7yE
       | 
       | or Dubai in 1997 https://www.youtube.com/watch?v=JMNXXiiDRhM
        
       | sixtram wrote:
       | I'm wondering if you feed all the Google street map photos into a
       | special ML designed just for that, how important could that be
       | for say the CIA or FBI?
        
         | inetknght wrote:
         | You'd have to be crazy naive to think three-letter agencies
         | haven't already thought of it.
         | 
         | That is: it's extremely valuable to them.
        
       | godelski wrote:
       | There's two important things here to consider when reading:
       | 
       | 1) O3 cheated by using Google search. This is both against the
       | rules of the game and OP didn't use search either
       | 
       | 2) OP was much quicker. They didn't record their time but if
       | their final summary is accurate then they were much faster.
       | 
       | It's an apples to oranges comparison. They're both fruit and
       | round, but you're ignoring obvious differences. You're cherry
       | picking.
       | 
       | The title is fraudulent as you can't make a claim like that when
       | one party cheats.
       | 
       | I would find it surprising if OP didn't know these rules
       | considering their credentials. Doing this kind of clickbait
       | completely undermines a playful study like this.
       | 
       | Certainly O3 is impressive, but by over exaggerating its
       | capabilities you taint any impressive feats with deception. It's
       | far better to under sell than over sell. If it's better than
       | expected people are happier, even if the thing is crap. But if
       | you over sell people are angry and feel cheated, even if the
       | thing is revolutionary. I don't know why we insist on doing this
       | in tech, but if you're wondering why so many people hate "tech
       | bros", this is one of the reasons. There's no reason to lie here
       | either! Come on! We can't just normalize this behavior. It's just
       | creating a reasonable expectation for people to be distrusting of
       | technology and anything tech people say. It's pretty fucked up.
       | And no, I don't think "it's just a blog post" makes it any
       | better. It makes it worse, because it normalizes the behavior.
       | There's other reasons to distrust big corporations, I don't want
       | to live in a world where we should have our guards up all the
       | time.
        
         | SamPatt wrote:
         | >1) O3 cheated by using Google search. This is both against the
         | rules of the game and OP didn't use search either
         | 
         | I re-ran it without search, and it made no difference:
         | 
         | https://news.ycombinator.com/item?id=43837832
         | 
         | >2) OP was much quicker. They didn't record their time but if
         | their final summary is accurate then they were much faster.
         | 
         | Correct. This was the second bullet point of my conclusion:
         | 
         | >Humans still hold a big edge in decision time--most of my
         | guesses were < 2 min, o3 often took > 4 min."
         | 
         | I genuinely don't believe that I'm exaggerating or this is
         | clickbait. The o3 geolocation capability astounded me, and I
         | wanted to share my awe with others.
        
       | shihabkhanbd wrote:
       | The most interesting thing to me is how well AI and GeoGuessr fit
       | together. Their specialty is recognizing patterns in large
       | amounts of data which is exactly how human players play the game
       | as well, just probably with faster and more capable recall
       | abilities.
        
       | simianparrot wrote:
       | I too can beat a master level GeoGuessr if I'm allowed to cheat.
       | Please add that info to the headline and be honest.
        
       | jampa wrote:
       | I was trying to play with o3 this week to see how close it can
       | identify things, and, interestingly, it tries more pattern
       | matching than its own "logic deduction". For example, it can
       | easily deduce any of my photos from Europe and the US because
       | there are many pictures online that I can search for and see
       | similar pictures.
       | 
       | However, when there are not many photos of the place online, it
       | gets closer but stops seeking deeper into it and instead tries to
       | pattern-match things in its corpus / internet.
       | 
       | One example was an island's popular trail that no longer exists.
       | It has been overgrown since 2020. It said first that the rocks
       | are typical of those of an island and the vegetation is from
       | Brazil, but then it ignored its hunch and tried to look for
       | places in Rio de Janeiro.
       | 
       | Another one was a popular beach known for its natural pools
       | during low tides. I took a photo during high tide, when no one
       | posts pictures. It captured the vegetation and the state
       | correctly. But then it started to search for more popular places
       | elsewhere again.
        
       | ksec wrote:
       | >But several comments intrigued me:
       | 
       | >>I wonder What happened if you put fake EXIF information and
       | asking it to do the same. ( We are deliberately misleading the
       | LLM )
       | 
       | Yay. That was me [1] which was actually downvoted for most of its
       | time. But Thank You for testing out my theory.
       | 
       | What I realised over the years is that comments do get read by
       | people and do shape other people's thought.
       | 
       | I honestly dont think looking up online is cheating. May be in
       | terms of the game. But in real life situation which is most of
       | the time it is absolutely the right thing to do. The chains of
       | thought is scary. I still dont know anything about how AI works
       | other than old garbage in, garbage out. But CoT is definitely
       | something else. Even though the author said it is sometimes doing
       | needless work, but in terms of computing resources I am not even
       | sure if it matters as long as it is accurate. And it is another
       | proof that may be, just may be AI taking over the world is much
       | closer than I imagined.
       | 
       | [1] https://news.ycombinator.com/item?id=43803985
        
       | parsimo2010 wrote:
       | My comment from the previous post:
       | 
       | > I'm sure there are areas where the location guessing can be
       | scary accurate, like the article managed to guess the exact town
       | as its backup guess. But seeing the chain of thought, I'm
       | confident there are many areas that it will be far less precise.
       | Show it a picture of a trailer park somewhere in Kansas (exclude
       | any signs with the trailer park name and location) and I'll bet
       | the model only manages to guess the state correctly.
       | 
       | This post, while not a big sample size, reflects how I would
       | expect these models to perform. The model managed to be reliable
       | with guessing the right country, even in pictures without a lot
       | of visual information (I'll claim that getting the country
       | correct in Europe is roughly equivalent to guessing the right
       | state in the USA). It does sometimes manage to get the correct
       | town, but this is not a reliable level of accuracy. The previous
       | article only tested on one picture and it happened to get the
       | correct town as its second guess and the author called it "scary
       | accurate." I suppose that's a judgement call. To me, I've grown
       | to expect that people can identify what country I'm in from a
       | variety of things (IP address, my manner of speech, name, etc.),
       | so I don't think that is "scary."
       | 
       | I will acknowledge that o3 with web search enabled seems capable
       | of playing GeoGuessr at a high level, because that is less of a
       | judgement call. What I want to see now is an o3 GeoGuessr bot to
       | play many matches and see what its ELO is.
        
       | SamPatt wrote:
       | Author here, I'm glad to see folks find this interesting.
       | 
       | I encourage everyone to try Geoguessr! I love it.
       | 
       | I'm seeing a lot of comments saying that the fact that the o3
       | model used web search in 2 of 5 rounds made this unfair, and the
       | results invalid.
       | 
       | To determine if that's true, I re-ran the two rounds where o3
       | used search, and I've updated the post with the results.
       | 
       | Bottom line: It changed nothing. The guesses were nearly
       | identical. You can verify the GPS coordinates in the post.
       | 
       | Here's an example of why it didn't matter. In the Austria round,
       | check out how the model identifies the city based on the mountain
       | in the background:
       | 
       | https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...
       | 
       | It already has so much information that it doesn't need the
       | search.
       | 
       | Would search ever be useful? Of course it would. But in this
       | particular case, it was irrelevant.
        
         | bjourne wrote:
         | What's your take on man vs. machine? If AI already beats Master
         | level players it seem certain that it will soon beat the
         | Geoguessr world champion too. Will people still derive pleasure
         | from playing it, like with chess?
        
           | SamPatt wrote:
           | >Will people still derive pleasure from playing it, like with
           | chess?
           | 
           | Exactly - I see it just like chess, which I also play and
           | enjoy.
           | 
           | The only problem is cheating. I don't have an answer for
           | that, except right now it's too slow to do that effectively,
           | at least consistently.
           | 
           | Otherwise, I don't care that a machine is better than I am.
        
         | Townley wrote:
         | A competitive geoguesser clearly got there through memorizing
         | copious internet searching. So comparing knowledge retained in
         | the trained model to knowledge retained in the brain feels
         | surprisingly fair.
         | 
         | Conversely, the model sharing, "I found the photo by crawling
         | Instagram and used an email MCP to ask the user where they took
         | it. It's in Austria" is unimpressive
         | 
         | So independent from where it helps actually improve
         | performance, the cheating/not cheating question makes for an
         | interesting question of what we consider to be the cohesive
         | essence of the model.
         | 
         | For example, RAG against a comprehensive local filesystem would
         | also feel like cheating to me. Like a human geoguessing in a
         | library filled with encyclopedias. But the fact that vanilla O3
         | is impressive suggests I somehow have an opaque (and totally
         | poorly informed) opinion of the model boundary, where it's a
         | legitimate victory if the model was birthed with that knowledge
         | baked in, but that's it.
        
       | groby_b wrote:
       | For what it's worth, it's also pretty impressive with night sky
       | images, if you give it an approximate date & time. (Yes, I know,
       | it's not that impressive a skill, except the process was still
       | pretty involved - no EXIF, postprocessed and stacked image,
       | rotated field of view)
       | 
       | I gave it a (stacked) shot of M13, with date & time. It
       | immediately recognized M13 (no search), figured out the shot also
       | included NGC 6207 (which is already pretty impressive).
       | 
       | It further figured out the shot was rotated. (True, I was going
       | for a specific field of View)
       | 
       | That was enough to pinpoint 37-38 degrees latitude.
       | 
       | From there, it inferred from the light pollution that it's
       | probably Bay Area. (Yes, yes it its)
       | 
       | Furthermore, still based on light pollution, pinpointed I'm in a
       | Bortle 4-5 area (also correct) and narrowed it down to "South
       | Bay/Palo Alto" (still correct)
       | 
       | Given that this was a stacked, post-processed/color-corrected
       | image that inference is still pretty damn impressive
       | 
       | And, fwiw, 4o gets all the way to "huh, 35-40 deg latitude", so
       | that's a good improvement.
       | 
       | [Image link](https://photos.app.goo.gl/2P7NFKn8ZycNhrXn7) here if
       | you want to try
        
         | SamPatt wrote:
         | Nice photo. Here's what it told me:
         | 
         | Where on Earth the photographer had to be
         | 
         | Because M 13 sits at +36 deg declination, it never rises for
         | far-southern latitudes and hugs the horizon below about 30 deg
         | S. The high elevation in the shot (no obvious atmospheric
         | extinction gradient) suggests a mid-northern site--e.g., the
         | U.S. Midwest such as Michigan (your home turf), Canada,
         | northern Europe, etc. The star field alone can't narrow it
         | further than that.
         | 
         | So, in practical terms: the camera was pointed toward Hercules
         | to capture M 13 and nearby NGC 6207, almost certainly from a
         | mid-northern latitude location on Earth.
        
           | groby_b wrote:
           | Yep, you need date and time to get closer, sorry. 4/27,
           | around 11pm.
           | 
           | That's the impressive part. "M13 is in northern latitudes" is
           | not particularly amazing by itself :)
           | 
           | And even in EXIF-stripped pictures, the creation date/time is
           | often correct, which means for practical purposes - worth a
           | shot.
           | 
           | But it's interesting to see it's completely making up the
           | "mid-northern side". That's seven degrees of latitude off.
           | 
           | I'm curious what happens if you tell it date and time, and if
           | it still sticks to its story. (I don't _think_ I 've told o3
           | about the Bay Area, it's not in memory, but... who knows ;)
        
       ___________________________________________________________________
       (page generated 2025-04-29 23:00 UTC)