[HN Gopher] Gemini 2.5
___________________________________________________________________
Gemini 2.5
Author : meetpateltech
Score : 523 points
Date : 2025-03-25 17:01 UTC (5 hours ago)
(HTM) web link (blog.google)
(TXT) w3m dump (blog.google)
| jasonpeacock wrote:
| Isn't every new AI model the "most <adjective>"?
|
| Nobody is going to say "Announcing Foobar 7.1 - not our best!"
| wigster wrote:
| gobble 2.0 - a bit of a turkey
| minimaxir wrote:
| GPT-4.5's announcement was the equivalent of that.
|
| "It beats all the benchmarks...but you really really don't want
| to use it."
| forbiddenvoid wrote:
| They even priced it so people would avoid using it. GPT-4.5's
| entire function was to be the anchor of keeping OpenAI in the
| news, to keep up the perception of releasing quickly.
| glenstein wrote:
| My assumption was that the pricing was because it really
| was that expensive for whatever reason. I'm keeping fingers
| crossed that they're going to do some kind of 4.5 mini at
| some point that will be more affordable.
| thethimble wrote:
| Except for GPT 4.5 and Claude 3.7 :/
| smilekzs wrote:
| You're not wrong, but that just means the <adjective> is where
| the bulk of information resides. The trade-off matters. Maybe
| it's a model with good enough quality but really cheap to
| serve. Maybe it's a model that only plays poker really well but
| sucks at everything else because it bluffs too much. Etc. etc.
| Maxatar wrote:
| Sure but that adjective matters. Could be cheapest,
| "intelligent", fastest, etc... it's rarely all three of them.
| zamadatix wrote:
| Stable Diffusion definitely had a few of those.
| SirMaster wrote:
| Same with new phones. The new phone is always the fastest cpu,
| gpu, and best camera ever!
| jharohit wrote:
| why not enable Canvas for this model on Gemini.google.com?
| Arguably the weakest link of Canvas is the terrible code that
| Gemini 2.0 Flash writes for Canvas to run..
| aoeusnth1 wrote:
| I'm guessing it should be enabled eventually. @logankilpatrick
| thoughts?
| cj wrote:
| Slight tangent: Interesting that they use o3-mini as the
| comparison rather than o1.
|
| I've been using o1 almost exclusively for the past couple months
| and have been impressed to the point where I don't feel the need
| to "upgrade" for a better model.
|
| Are there benchmarks showing o3-mini performing better than o1?
| boldlybold wrote:
| I find o3 at least faster to get to the response I care about,
| anecdotally.
| FloorEgg wrote:
| I noticed this too, I have used both o1 and o3 mini
| extensively, and I have ran many tests on my own problems and
| o1 solves one of my hardest prompts quite reliably but o3 is
| very inconsistent. So from my anecdotal experience o1 is a
| superior model in terms of capability.
|
| The fact they would exclude it from their benchmarks seems
| biased/desperate and makes me trust them less. They probably
| thought it was clever to leave o1 out, something like "o3 is
| the newest model lets just compare against that", but I think
| for anyone paying attention that decision will backfire.
| jnd0 wrote:
| Probably because It is more similar to o3 in terms of
| size/parameters as well as price (although I would expect this
| to be at least half price)
| logicchains wrote:
| It's a reasonable comparison given it'll likely be priced
| similarly to o3-mini. I find o1 to be strictly better than
| o3-mini, but still use o3-mini for the majority of my agentic
| workflow because o1 is so much more expensive.
| PunchTornado wrote:
| Why would you compare against all the models from a competitor.
| You take their latest one that you can test. Openai or
| anthropoc don't compare against the whole gemini family.
| kmod wrote:
| The benchmark numbers don't really mean anything -- Google says
| that Gemini 2.5 Pro has an AIME score of 86.7 which beats
| o3-mini's score of 86.5, but OpenAI's announcement post [1]
| said that o3-mini-high has a score of 87.3 which Gemini 2.5
| would lose to. The chart says "All numbers are sourced from
| providers' self-reported numbers" but the only mention of
| o3-mini having a score of 86.5 I could find was from this other
| source [2]
|
| [1] https://openai.com/index/openai-o3-mini/ [2]
| https://www.vals.ai/benchmarks/aime-2025-03-24
|
| You just have to use the models yourself and see. In my
| experience o3-mini is much worse than o1.
| vineyardmike wrote:
| I wonder what about this one gets the +0.5 to the name. IIRC the
| 2.0 model isn't particularly old yet. Is it purely marketing,
| does it represent new model structure, iteratively more training
| data over the base 2.0, new serving infrastructure, etc?
|
| I've always found the use of the *.5 naming kinda silly when it
| became a thing. When OpenAI released 3.5, they said they already
| had 4 underway at the time, they were just tweaking 3 be better
| for ChatGPT. It felt like a scrappy startup name, and now it's
| spread across the industry. Anthropic naming their models Sonnet
| 3, 3.5, 3.5 (new), 3.7 felt like the worst offender of this
| naming scheme.
|
| I'm a much bigger fan of semver (not skipping to .5 though), date
| based ("Gemini Pro 2025"), or number + meaningful letter (eg 4o -
| "Omni") for model names.
| forbiddenvoid wrote:
| I would consider this a case of "expectation management"-based
| versioning. This is a release designed to keep Gemini in the
| news cycle, but it isn't a significant enough improvement to
| justify calling it Gemini 3.0.
| jstummbillig wrote:
| I think it's reasonable. The development process is just not
| really comparable to other software engineering: It's fairly
| clear that currently nobody _really_ has a good grasp on what
| a model will be while they are being trained. But they do
| have expectations. So you do the training, and then you
| assign the increment to align the two.
| KoolKat23 wrote:
| Funnily enough, from early indications (user feedback) this
| new model would've been worthy of the 3.0 moniker, despite
| what the benchmarks say.
| falcor84 wrote:
| Agreed, can't everyone just use semantic versioning, with 0.1
| increments for regular updates?
| laurentlb wrote:
| Regarding semantic versioning: what would constitute a
| breaking change?
|
| I think it makes sense to increase the major / minor numbers
| based on the importance of the release, but this is not
| semver.
| falcor84 wrote:
| As I see it, if it uses a similar training approach and is
| expected to be better in every regard, then it's a minor
| release. Whereas when they have a new approach and where
| there might be some tradeoffs (e.g. longer runtime), it
| should be a major change. Or if it is very significantly
| different, then it should be considered an entirely
| differently named model.
| morkalork wrote:
| Or drop the pretext of version numbers entirely since they're
| meaningless here and go back to classics like Gemini
| Experience, Gemini: Millennium Edition or Gemini New
| Technology
| joaogui1 wrote:
| Would be confusing for non-tech people once you did x.9 ->
| x.10
| guelo wrote:
| What would a major version bump look like for an llm?
| Workaccount2 wrote:
| At least for OpenAI, a .5 increment indicates a 10x increase in
| training compute. This so far seems to track for 3.5, 4, 4.5.
| aoeusnth1 wrote:
| I think it's because of the big jump in coding benchmarks. 74%
| on aider is just much, much better than before and worthy of a
| .5 upgrade.
| alphabetting wrote:
| The elo jump and big benchmark gains could be justification
| ekojs wrote:
| > This will mark the first experimental model with higher rate
| limits + billing. Excited for this to land and for folks to
| really put the model through the paces!
|
| From https://x.com/OfficialLoganK/status/1904583353954882046
|
| The low rate-limit really hampered my usage of 2.0 Pro and the
| like. Interesting to see how this plays out.
| chipgap98 wrote:
| Any word on what that pricing is? I can't seem to find it
| rvnx wrote:
| Traditionally at Google experimental models are 100% free to
| use on https://aistudio.google.com (this is also where you
| can see the pricing) with a quite generous rate limit.
|
| This time, the Googler says: "good news! you will be charged
| for experimental models, though for now it's still free"
| chipgap98 wrote:
| Right but the tweet I was responding to says: "This will
| mark the first experimental model with higher rate limits +
| billing. Excited for this to land and for folks to really
| put the model through the paces!"
|
| I assumed that meant there was a paid version with a higher
| rate limit coming out today
| KoolKat23 wrote:
| I expect this might be pricier. Hoping not unusable level
| expensive.
| xnx wrote:
| Currently free, but only 50 requests/day.
| M4v3R wrote:
| The Long Context benchmark numbers seem super impressive. 91% vs
| 49% for GPT 4.5 at 128k context length.
| Workaccount2 wrote:
| Google has the upperhand here because they are not dependent on
| nvidia for hardware. They make and uses their own AI
| accelerators.
| alexdzm wrote:
| Keen to hear more about this benchmark. Is it representative of
| chat-to-document style usecases with big docs?
| sebzim4500 wrote:
| Looks like it's this benchmark [1]. It's certainly less
| artificial than most long context benchmarks (that are
| basically just a big lookup table) but probably not as
| representative as Fiction.LiveBench [2], which asks specific
| questions about works of fanfiction (which are typically
| excluded from training sets because they are basically porn).
|
| [1] https://arxiv.org/pdf/2409.12640
|
| [2] https://fiction.live/stories/Fiction-liveBench-
| Feb-20-2025/o...
| swyx wrote:
| "MRCR (multi-round coreference resolution)" for those
| looking for the link to Michaelangelo
| falcor84 wrote:
| I'm most impressed by the improvement on Aider Polyglot; I wasn't
| expecting it to get saturated so quickly.
|
| I'll be looking to see whether Google would be able to use this
| model (or an adapted version) to tackle ARC-AGI 2.
| serjester wrote:
| I wish they'd mention pricing - it's hard to seriously benchmark
| models when you have no idea what putting it in production would
| actually cost.
| impulser_ wrote:
| It's experimental. You shouldn't be using it in production.
| chipgap98 wrote:
| > This will mark the first experimental model with higher
| rate limits + billing. Excited for this to land and for folks
| to really put the model through the paces! From
| https://x.com/OfficialLoganK/status/1904583353954882046
| Oras wrote:
| These announcements have started to look like a template.
|
| - Our state-of-the-art model.
|
| - Benchmarks comparing to X,Y,Z.
|
| - "Better" reasoning.
|
| It might be an excellent model, but reading the exact text
| repeatedly is taking the excitement away.
| Mistletoe wrote:
| I'm sure the AI helps write the announcements.
| ototot wrote:
| Maybe they just asked Gemini 2.5 to write the announcement.
| cpeterso wrote:
| And it was trained on the previous announcements.
| xlbuttplug2 wrote:
| ... which were also written by earlier Gemini versions.
| belter wrote:
| LLMs all the way down
| cratermoon wrote:
| Not all the way. At the bottom are a bunch of unpaid
| writers and artists and a horde of low-paid mturk workers
| in Nigeria.
| rvnx wrote:
| which was written by ChatGPT3.5
| andai wrote:
| Well hey, OpenAI did the exact opposite, and nobody liked that
| either.
| Oras wrote:
| How did you measure "nobody" liked OpenAI announcements?
| exitb wrote:
| I think people were fine with OpenAI demos. They were less
| fine with not actually ever releasing the demoed tech.
| emptysongglass wrote:
| I liked their announcements and demos and continue to like
| them.
| bhouston wrote:
| > It might be an excellent model, but reading the exact text
| repeatedly is taking the excitement away.
|
| This is the commodification of models. There is nothing special
| about the new models but they perform better on the benchmarks.
|
| They are all interchangeable. This is great for users as it
| adds to price pressure.
| flir wrote:
| Man, I hope those benchmarks actually measure something.
| Legend2440 wrote:
| I would say they are a fairly good measure of how well the
| model has integrated information from pretraining.
|
| They are not so good at measuring reasoning, out-of-domain
| performance, or creativity.
| Workaccount2 wrote:
| Sooner or later someone is going to find "secret sauce" that
| provides a step-up in capability, and it will be closely
| guarded by whoever finds it.
|
| As big players look to start monetizing, they are going to
| desperately be searching for moats.
| bangaladore wrote:
| Reasoning was supposed to be that for "Open" AI, that's why
| they go to such lengths to hide the reasoning output. Look
| how that turned out.
|
| Right now, in my opinion, OpenAI has actually a useful deep
| research feature which I've found nobody else matches. But
| there is no moat to be seen there.
| CamperBob2 wrote:
| If you've seen DeepSeek R1's <think> output, you'll
| understand why OpenAI hides their own. It can be pretty
| "unsafe" relative to their squeaky-clean public image.
| cratermoon wrote:
| Sooner or later someone is going to find the "secret sauce"
| that allows building a stepladder tall enough to reach the
| moon.
|
| It's called the "first step fallacy", and AI hype believers
| continue to fall for it.
| diego_sandoval wrote:
| I take this as a good thing, because they're beating each other
| every few weeks and using benchmarks as evidence.
|
| If these companies start failing to beat the competition, then
| we should prepare ourselves for very creative writing in the
| announcements.
| gtirloni wrote:
| The improvements have been marginal at best. I wouldn't call
| that beating.
| devsda wrote:
| Reminds me of how nobody is too excited about flagship mobile
| launches anymore. Most flagships for sometime now are just
| incremental updates over previous gen and only marginally
| better. Couple that with the chinese OEMs launching better or
| good enough devices at a lower price point, new launches from
| established players are not noteworthy anymore.
|
| It's interesting how the recent AI announcements are following
| the same trend over a smaller timeframe.
| flakiness wrote:
| Phones are limited by hardware manufacturing, plus maybe the
| annual shopping cycle peaking at Christmas. People won't have
| bought multiple iPhones even in its heyday.
|
| These LLM models were supposedly limited by the training run,
| but these point-version models are mostly post-training
| driven, which seems to be taking less time.
|
| If models were tied to a specific hardware (say, a "AI PC" or
| whatever) the cycle would get slower and we'll get a slower
| summer which I'm secretly wishing.
| breppp wrote:
| I think the greatest issue with buying a new phone today is
| ironically the seamless migration.
|
| once you get all your apps, wallpaper, shortcut order and
| same OS, you really quickly get the feeling you spent 1000$
| for the exact same thing
| atonse wrote:
| 100% agree with you.
|
| But it needs to be seamless to remove any friction from the
| purchase, but at the same time if it feels the same then we
| felt like we wasted money.
|
| So what I usually do is buy a different colored phone and
| change the wallpaper.
|
| My MacBook was the same. Seamless transition and 2 hours
| later I was used to the new m4 speeds.
| schainks wrote:
| I wish I wish I wish Google put better marketing into these
| releases. I've moved entire workflows to Gemini because it's
| just _way_ better than what openai has to offer, especially for
| the money.
|
| Also, I think google's winning the race on actually integrating
| the AI to do useful things. The agent demo from OpenAI is
| interesting, but frankly, I don't care to watch the machine use
| my computer. A real virtual assistant can browse the web
| headless and pick flights or food for me. That's the real
| workflow unlock, IMO.
| belter wrote:
| Cancelled my account long time ago. Gemini models are like a
| McDonalds Croissant. You always give them an extra chance, but
| they always fall apart on your hands...
| tibbar wrote:
| For me, the most exciting part is the improved long-context
| performance. A lot of enterprise/RAG applications rely on
| synthesizing a bunch of possibly relevant data. Let's just say
| it's clearly a bottleneck in current models and I would expect
| to see a meaningful % improvement in various internal
| applications if long-context reasoning is up. Gemini was
| already one of my favorite models for this usecase.
|
| So, I think these results are very interesting, if you know
| what features specifically you are using.
| zwaps wrote:
| But they score it on their own benchmark, on which
| coincidentally Gemini models always were the only good ones.
| In Nolima or Babilong we see that Gemini models still cant do
| long context.
|
| Excited to see if it works this time.
| taf2 wrote:
| We have incrementally improved 1% better then we were
| yesterday. Our competition is 1 day behind us now.
| kh_hk wrote:
| Was going to comment the same thing, which has been bugging me
| off lately on all announcements that start with "our" followed
| by empty superlatives. Happy to not be alone on this!
| cratermoon wrote:
| Glaringly missing from the announcements: concrete use cases
| and products.
|
| The Achilles heel of LLMs is the distinct lack of practical
| real-world applications. Yes, Google and Microsoft have been
| shoving the tech into everything they can fit, but that doesn't
| a product make.
| sebzim4500 wrote:
| ChatGPT has like 500M weekly active users, what are you on
| about?
| jnd0 wrote:
| > with Gemini 2.5, we've achieved a new level of performance by
| combining a significantly enhanced base model with improved post-
| training. Going forward, we're building these thinking
| capabilities directly into all of our models, so they can handle
| more complex problems and support even more capable, context-
| aware agents.
|
| Been playing around with it and it feels intelligent and up to
| date. Plus is connected to the internet. A reasoning model by
| default when it needs to.
|
| I hope they enable support for the recently released canvas mode
| for this model soon it will be a good match.
| Workaccount2 wrote:
| It is almost certainly the "nebula" model on LLMarena that has
| been generating buzz for the last few days. I didn't test
| coding but it's reasoning is very strong.
| joelthelion wrote:
| Is this model going to be restricted to paying users?
| vivzkestrel wrote:
| hi, here is our new AI model, it performs task A x% better than
| our competitor 1, task B y% better than our competitor 2 seems to
| be the new hot AI template in town
| andai wrote:
| Can anyone share what they're doing with reasoning models? They
| seem to only make a difference with novel programming problems,
| like Advent of Code. So this model will help solve slightly
| harder advent of codes.
|
| By extension it should also be slightly more helpful for
| research, R&D?
| bravura wrote:
| We're using it to RCA infrastructure incidents.
| fn-mote wrote:
| Seriously? That doesn't require a human?! Are we talking
| about some kind of "generic" incident? (Type 3: forgot to
| manually update the xxxx file.) Or what's going on?
| liuliu wrote:
| I found reasoning models are much more faithful at text related
| tasks too (i.e. 1. translating long key-value pairs (i.e.
| Localizable.strings), 2. long transcript fixing and
| verification; 3. look at csv / tabular data and fix) probably
| due to the reflection mechanism built into these reasoning
| models. Using prompts such as "check your output to make sure
| it covers everything in the input" letting the model to double-
| check its work, avoiding more manual checks on my end.
| mgens wrote:
| Have been using them for non-interactive coding where latency
| is not an issue. Specifically, turning a set of many free-text
| requirements into SQL statements, so that later when an item's
| data is entered into the system, we can efficiently find which
| requirements it meets. The reasoning models' output quality is
| much better than the non-reasoning models like 3.5 Sonnet, it's
| not a subtle difference.
| throwaway13337 wrote:
| Google has this habit of 'releasing' without releasing AI models.
| This looks to be the same?
|
| I don't see it on the API price list:
|
| https://ai.google.dev/gemini-api/docs/pricing
|
| I can imagine that it's not so interesting to most of us until we
| can try it with cursor.
|
| I look forward to doing so when it's out. That Aider bench mixed
| with the speed and a long context window that their other models
| are known for could be a great mix. But we'll have to wait and
| see.
|
| More generally, it woud be nice for these kinds of releases to
| also add speed and context window as a separate benchmark. Or
| somehow include it in the score. A model that is 90% as good as
| the best but 10x faster is quite a bit more useful.
|
| These might be hard to mix to an overall score but they're
| critical for understanding usefulness.
| minimaxir wrote:
| It's available now as an option in Google AI Studio and Google
| Gemini.
| ototot wrote:
| and the price is 0.0 usd, lol
| throwaway13337 wrote:
| Thanks. I think my post lacked clarity of what I was talking
| about. I meant that most people care about API access to use
| with their favorite editor. It's a big limiter with grok, for
| example.
|
| But I did mingle that with my knowledge of google's history
| of releasing without releasing these models which, as you
| point out, isn't true with this release.
| kmod wrote:
| It's "experimental", which means that it is not fully
| released. In particular, the "experimental" tag means that it
| is subject to a different privacy policy and that they
| reserve the right to train on your prompts.
|
| 2.0 Pro is also still "experimental" so I agree with GP that
| it's pretty odd that they are "releasing" the next version
| despite never having gotten to fully releasing the previous
| version.
| barrenko wrote:
| The incumbent has awoken.
| andai wrote:
| How does Gemini have such a big context window?
|
| I thought memory requirement grows exponentially with context
| size?
| mmoskal wrote:
| Memory grows linearly, compute grows quadratically (but with
| small constant - until ~100k the inference will be still
| dominated by non-quadratic factors).
| xiphias2 wrote:
| Also reusing key/values for different queries can compress
| the KV cache, it can be an 1000x or 10000x improvement in
| bandwidth if the model is trained for it.
| mmoskal wrote:
| Just to clarify: simple prefix KV cache doesn't require any
| special model training. It does require the inference
| framework to support it, but most do by now.
|
| You can see dramatic improvements in latency and throughput
| if there is a large shared prefix of the queries.
| ototot wrote:
| The Gemini 1.5 tech report do reference some papers about
| supporting large context window.
| throitallaway wrote:
| Funnyish story: the other night I asked my Pixel 9 to generate
| an image via Gemini, then I asked it to make a change. It
| didn't consider the previous context, so I asked it "Are you
| capable of keeping context?" No matter how clearly I enunciated
| "context", it always interpreted what I was saying as
| "contacts." After the 4th try, I said "context, spelled "c-o-n-
| t-e-x-t" and it replied with "Ah, you meant context! Yes..."
|
| This stuff has a long way to go.
| andai wrote:
| I noticed Gemini Flash 2.0 making a lot of phonetic typos
| like that, yeah. Like instead of Basal Ganglia it said Basil
| Ganglia.
|
| I've also had it switch languages in the middle of output...
| like one word in the middle of a sentence was randomly output
| in some strange hieroglyphs, but when I translated them, it
| was the right word and the sentence made sense.
| dcchambers wrote:
| I was using the conversational feature of Gemini on my phone
| the other night and was trying to get it to read a blog post
| to me. The AI proceeded to tell me (out loud, via voice
| mode/speech synthesis) that it was a text based model and
| couldn't read text out loud.
|
| For as amazing as these things are, AGI they are not.
| vanviegen wrote:
| In its defense: it probably _is_ just a text model that
| hasn 't been told that its output is being read to the
| user.
| Workaccount2 wrote:
| I think google is digging a hole for themselves by making
| their lightweight models be the most used model. Regardless
| of what their heavy weight models can do, people will
| naturally associate them with their search model or assistant
| model.
| semiinfinitely wrote:
| TPUs have a network topology better suited for long context
| than gpus: https://jax-ml.github.io/scaling-book/tpus/#tpu-
| networking
|
| > This nearest-neighbor connectivity is a key difference
| between TPUs and GPUs. GPUs connect up to 256 H100s in an all-
| to-all configuration (called a node), rather than using local
| connections. On the one hand, that means GPUs can send
| arbitrary data within a node in a single low-latency hop. On
| the other hand, TPUs are dramatically cheaper and simpler to
| wire together, and can scale to much larger topologies because
| the number of links per device is constant.
| swyx wrote:
| https://supaiku.com/attention-is-logarithmic
| og_kalu wrote:
| From the 2.0 line, the Gemini models have been far better at
| Engineering type questions (fluids etc) than GPT, Claude
| especially with questions that have Images that require more than
| just grabbing text. This is even better.
| andrewinardeer wrote:
| Google is overly cautious with their guardrails.
|
| Granted, Gemini answers it now, however, this one left me shaking
| my head.
|
| https://cdn.horizon.pics/PzkqfxGLqU.jpg
| slongfield wrote:
| For better or worse, Google gets more bad press when their
| models get things wrong compared to smaller AI labs.
| rfoo wrote:
| Ha, I still remember that super hilarious "You are under 18, so
| you should not write C++, as it is unsafe..." log from ... a
| year ago?
| summerlight wrote:
| Looks like they're gradually removing guardrails, it returns
| Nixon for me.
| daquisu wrote:
| Weird, they released Gemini 2.5 but I still can't use 2.0 pro
| with a reasonable rate limit (5 RPM currently).
| noisy_boy wrote:
| Is Gemini and Bard same? I asked it a question and it said "...
| areas where I, as Bard, have..."
| zxilly wrote:
| There is no point in asking such questions, the model doesn't
| know what it is on its own, and you could get many different
| answers if you repeat it a few more times.
| noisy_boy wrote:
| Are Gemini and Bard same? I asked it a question and it said "...
| areas where I, as Bard, have...."
| guyzero wrote:
| Normal Google rollout process: Bard is deprecated, Gemini is
| not ready yet.
| comex wrote:
| I was recently trying to replicate ClaudePlaysPokemon (which uses
| Claude 3.7) using Gemini 2.0 Flash Thinking, but it was seemingly
| getting confused and hallucinating significantly more than
| Claude, making it unviable (although some of that might be caused
| by my different setup). I wonder if this new model will do
| better. But I can't easily test it: for now, even paid users are
| apparently limited to 50 requests per day [1], which is not
| really enough when every step in the game is a request. Maybe
| I'll try it anyway, but really I need to wait for them to
| "introduce pricing in the coming weeks".
|
| Edit: I did try it anyway and so far the new model is having
| similar hallucinations. I really need to test my code with Claude
| 3.7 as a control, to see if it approach the real
| ClaudePlaysPokemon's semi-competence.
|
| Edit 2: Here's the log if anyone is curious. For some reason it's
| letting me make more requests than the stated rate limit. Note
| how at 11:27:11 it hallucinates on-screen text, and earlier it
| thinks some random offscreen tile is the stairs. Yes, I'm sure
| this is the right model: gemini-2.5-pro-exp-03-25.
|
| https://a.qoid.us/20250325/
|
| [1] https://ai.google.dev/gemini-api/docs/rate-limits#tier-1
| comex wrote:
| Update: I tried a different version of the prompt and it's
| doing really well! Well, so far it's gotten out of its house
| and into Professor Oak's lab, which is not so impressive
| compared to ClaudePlaysPokemon, but it's a lot more than Gemini
| 2.0 was able to do with the same prompt.
| ototot wrote:
| And OpenAI is announcing their ImageGen in 4o
|
| https://news.ycombinator.com/item?id=43474112
| malisper wrote:
| I've been using a math puzzle as a way to benchmark the different
| models. The math puzzle took me ~3 days to solve with a computer.
| A math major I know took about a day to solve it by hand.
|
| Gemini 2.5 is the first model I tested that was able to solve it
| and it one-shotted it. I think it's not an exaggeration to say
| LLMs are now better than 95+% of the population at mathematical
| reasoning.
|
| For those curious the riddle is: There's three people in a
| circle. Each person has a positive integer floating above their
| heads, such that each person can see the other two numbers but
| not his own. The sum of two of the numbers is equal to the third.
| The first person is asked for his number, and he says that he
| doesn't know. The second person is asked for his number, and he
| says that he doesn't know. The third person is asked for his
| number, and he says that he doesn't know. Then, the first person
| is asked for his number again, and he says: 65. What is the
| product of the three numbers?
| bbstats wrote:
| Could you explain "The sum of two of the numbers is equal to
| the third"??
| malisper wrote:
| if the three numbers are a, b, and c, then either a+b=c,
| a+c=b, or b+c=a
| bena wrote:
| And they must all be positive integers.
|
| So A + B = C and A + C = B. But we know that A + B = C, so
| we can replace C with (A + B). So we know that A + A + B =
| B.
|
| So 2A + B = B. Or 2A = 0.
|
| And this holds any way you slice it.
|
| Even if you were to try and brute force it.
|
| A = 1
|
| B = 2
|
| Then C = 3. But A + C has to equal B. That's 1 + 3 = 2?
| That's not true.
|
| I don't see a case where you can add to the sum of two
| numbers one of the numbers and get the other number.
|
| I'm guessing that's a misreading of the problem. Because it
| looks like the third number is the sum of the first two.
| refulgentis wrote:
| _One of_ the cases has to be true, not all 3. (as you
| show, they 're mutually exclusive for positive integers)
| i.e. "either" is important in the parent comment.
| bena wrote:
| Which is why I indicated that it would be a misreading of
| the problem.
|
| The original problem is a little ambiguously worded. You
| could say "one of their numbers is the sum of the other
| two" and it would be a little clearer.
| thaumasiotes wrote:
| > The original problem is a little ambiguously worded.
|
| No it isn't. If it said "the sum of any two of the
| numbers is equal to the third", that would be a
| contradiction. What it says is "the sum of two of the
| numbers is equal to the third".
| refulgentis wrote:
| Given #s x,y, and z, either x + y = z, x + z = y, or y + z =
| x.
| rappatic wrote:
| I think:
|
| Call the three numbers a, b, and c. This means c = a + b, but
| we still don't know to which person each number belongs.
|
| When person 1 (p1) is asked what his number is, he has no way
| to know whether he has a, b, or c, so he says he doesn't
| know. Same goes for p2 and p3. Clearly p1 somehow gains
| information by p2 and p3 passing. Either he realizes that he
| must be either a or b, and such his number is the difference
| between p2 and p3's numbers, or he realizes that he must be c
| and so his number is the sum of p2 and p3's numbers.
|
| That's all I have so far. Anyone have other ideas?
| aardvarkr wrote:
| I think it has something to do with applying the lower
| bound of 1.
|
| If p1 KNOWS that he's the largest then he has to have
| gained some other piece of information. Say the numbers he
| sees are 32 and 33. His number would have to be either 1 or
| 65. If p1 was 1 then the other two would have known p1
| couldn't be the sum of the other two
| oezi wrote:
| But p2 and p3 don't yet know what they are themselves
| just because they see a 1:
|
| If p2 sees 1 and 33, s/he would wonder if s/he is 32 or
| 34.
|
| P3 would consider 31 or 33.
| bena wrote:
| The answer is online and it's clever.
|
| P1 knows that P2 and P3 are not equal. So they know that
| the set isn't [2A, A, A].
|
| P2 knows that P1 and P3 are not equal. So they know that
| the set isn't [A, 2A, A]. They also know that if P1 doesn't
| know, then they were able to make the same deduction. So
| they now know that both [2A, A, A] and [A, 2A, A] aren't
| correct. Since they know that [2A, A, A] isn't correct,
| they can also know that [2A, 3A, A] isn't correct either.
| Because they'd be able to see if P1 = 2A and P3 = A, and if
| that were true and P1 doesn't know their number, it would
| have to be because P2 isn't A. And if P2 isn't A, they'd
| have to be 3A.
|
| P3 knows that P1 and P2 aren't equal. Eliminates [A, A,
| 2A]. Knows that [2A, A, A], [A, 2A, A], and [2A, 3A, A],
| are eliminated. Using the same process as P2, they can
| eliminate [2A, A, 3A], [A, 2A, 3A], and also [2A, 3A, 5A].
| Because they can see the numbers and they know if P1 is 2A
| and P2 is 3A.
|
| Now we're back at P1. Who now knows.
|
| So P2 and P3 are in the eliminated sets. Which means we're
| one of these
|
| [2A, A, A]; [3A, 2A, A]; [4A, 3A, A]; [3A, A, 2A]; [4A, A,
| 3A]; [5A, 2A, 3A]; [8A, 3A, 5A]
|
| We know his number is 65. To find the set, we can factor
| 65: (5 * 13). We can check the other numbers 2(13) = 26.
| 3(13) = 39. And technically, you don't need to find the
| other numbers. The final answer is 5A * 2A * 3A or (A^3) *
| 30.
| byearthithatius wrote:
| "Which means we're one of these [2A, A, A]; [3A, 2A, A];
| [4A, 3A, A]; [3A, A, 2A]; [4A, A, 3A]; [5A, 2A, 3A]; [8A,
| 3A, 5A]"
|
| Why? Couldn't it be an infinite number of 3 size arrays
| comprised of A where two elements sum to the third? [24A,
| 13A, 11A]? How did we deduce this set of arrays?
|
| EDIT: Solved from another reddit comment. Tuples without
| a common factor like the one above are considered as a=1.
|
| "They're not eliminated; they correspond to a = 1."
| g105b wrote:
| Please can you enlighten me, I'm a mathematic plebian?
| sebzim4500 wrote:
| This is a great riddle. Unfortunately, I was easily able to
| find the exact question with a solution (albeit with a
| different number) online, thus it will have been in the
| training set.
| varispeed wrote:
| Seems like we might need a section of internet that is off
| limits to robots.
| Centigonal wrote:
| everyone with limited bandwidth has been trying to limit
| site access to robots. the latest generation of AI web
| scrapers are _brutal_ and do not respect robots.txt
| varispeed wrote:
| There are websites where you can only register to in
| person and have two existing members vouch for you.
| Probably still can be gamed, but sounds like a great
| barrier to entry for robots (for now).
| tmpz22 wrote:
| What prevents someone from getting access and then
| running an authenticated headless browser to scoop the
| data?
| varispeed wrote:
| Admins will see unusual traffic from that account and
| then take action. Of course it will not be perfect as
| there could be a way to mimic human traffic and slowly
| scrape the data anyway, that's why there is element of
| trust (two existing members to vouch).
| baq wrote:
| It's here and it's called discord.
| Zandikar wrote:
| I have bad news for you if you think non paywalled / non
| phone# required discord communities are immune to AI
| scraping, especially as it costs less than hammering
| traditional websites as the push-on-change event is done
| for you in real time chat contexts.
|
| Especially as the company archives all those chats (not
| sure how long) and is small enough that a billion dollar
| "data sharing" agreement would be a very inticing offer.
|
| If there isn't a significant barrier to access, it's
| being scraped. And if that barrier is money, it's being
| scraped but less often.
| Davidzheng wrote:
| Honestly someone should scrape the algebraic topology
| Discord to AI it'll be a nice training set
| kylebenzle wrote:
| Or we could just accept that LLMs can only output what we
| have put in and calling them, "AI" was a misnomer from day
| one.
| Workaccount2 wrote:
| What makes this interesting is that while the question is
| online (on reddit, from 10 years ago) other models don't get
| the answer right. Gemini also shows it's work and it seems to
| do a few orders of magnitude more calculating then the
| elegant answer given on reddit.
|
| Granted this is all way over my head, but the solution gemini
| comes to matches the one given on reddit (and now here in
| future training runs)
|
| 65x26x39=65910
| sebzim4500 wrote:
| >Gemini also shows it's work and it seems to do a few
| orders of magnitude more calculating then the elegant
| answer given on reddit.
|
| I don't think Gemini does an unnecessary amount of
| computation, it's just more verbose. This is typical of
| reasoning models, almost every step is necessary but many
| would not be written down by a human.
| hmottestad wrote:
| Is the answer somehow {65, 20, 45} with the product 58,500?
|
| That's one-shot for o1 pro.
| sebzim4500 wrote:
| That's wrong. From player 1's perspective {25 20 45} is
| entirely consistent with the calls made in the first three
| rounds.
| refulgentis wrote:
| In general I find commentary here _too_ negative on AI, but I
| 'm a bit squeamish about maximalist claims re: AI mathematical
| reasoning vs. human population based off this, even setting
| aside lottery-ticket-hypothesis-like concerns.
|
| It's a common logic puzzle, Google can't turn up an _exact_
| match to the wording you have, but ex. here:
| https://www.futilitycloset.com/2018/03/03/three-hat-problem/
| TrackerFF wrote:
| The riddle certainly nerd-sniped GPT 4.5
|
| After a couple of minutes it decided on the answer being 65000.
| (S = {65, 40, 25)}
| semiinfinitely wrote:
| I love how the entire comment section is getting one-shotted by
| your math riddle instead of the original post topic.
| hmottestad wrote:
| This looks like it's been posted on Reddit 10 years ago:
|
| https://www.reddit.com/r/math/comments/32m611/logic_question...
|
| So it's likely that it's part of the training data by now.
| canucker2016 wrote:
| You'd think so, but both Google's AI Overview and Bing's
| CoPilot output wrong answers.
|
| Google spits out: "The product of the three numbers is 10,225
| (65 * 20 * 8). The three numbers are 65, 20, and 8."
|
| Whoa. Math is not AI's strong suit...
|
| Bing spits out: "The solution to the three people in a circle
| puzzle is that all three people are wearing red hats."
|
| Hats???
|
| Same text was used for both prompts (all the text after 'For
| those curious the riddle is:' in the GP comment), so Bing
| just goes off the rails.
| moritzwarhier wrote:
| That's a non-sequitur, they would be stupid to run ab
| expensive _L_LM for every search query. This post is not
| about Google Search being replaced by Gemini 2.5 and/or a
| chatbot.
| canucker2016 wrote:
| Google placed its "AI overview" answer at the top of the
| page.
|
| The second result is this reddit.com answer, https://www.
| reddit.com/r/math/comments/32m611/logic_question...,
| where at least the numbers make sense. I haven't examined
| the logic portion of the answer.
|
| Bing doesn't list any reddit posts (that Google-exclusive
| deal) so I'll assume no stackexchange-related sites have
| an appropriate answer (or bing is only looking for hat-
| related answers for some reason).
| moritzwarhier wrote:
| I might have been phrasing poorly. With _L_ (or _L_ as
| intended), I meant their state-of-the-art model, which I
| presume Gemini 2.5 is (didn 't come around to TFA yet).
| Not sure if this question is just about model size.
|
| I'm eagerly awaiting an article about RAG caching
| strategies though!
| michaelt wrote:
| Yes, putting an expensive LLM response atop each search
| query would be quite stupid.
|
| You know what would be even stupider? Putting a _cheap,
| wrong_ LLM response atop each search query.
| vicek22 wrote:
| The riddle has a different variants with hats
| https://erdos.sdslabs.co/problems/5
| malisper wrote:
| Other models aren't able to solve it so there's something
| else happening besides it being in the training data. You can
| also vary the problem and give it a number like 85 instead of
| 65 and Gemini is still able to properly reason through the
| problem
| lolinder wrote:
| I'm sure you're right that it's more than just it being in
| the training data, but that it's in the training data means
| that you can't draw any conclusions about general
| mathematical ability using just this as a benchmark, even
| if you substitute numbers.
|
| There are lots of possible mechanisms by which this
| particular problem would become more prominent in the
| weights in a given round of training even if the model
| itself hasn't actually gotten any better at general
| reasoning. Here are a few:
|
| * Random chance (these are still statistical machines after
| all)
|
| * The problem resurfaced recently and shows up more often
| than it used to.
|
| * The particular set of RLHF data chosen for this model
| draws out the weights associated with this problem in a way
| that wasn't true previously.
| 7e wrote:
| Sure, but you can't cite this puzzle as proof that this
| model is "better than 95+% of the population at
| mathematical reasoning" when the method of solving (the
| "answer") it is online, and the model has surely seen it.
| mattkevan wrote:
| I think there's a big push to train LLMs on maths problems
| - I used to get spammed on Reddit with ads for data tagging
| and annotation jobs.
|
| Recently these have stopped and they're now the ads are
| about becoming a maths tutor to AI.
|
| Doesn't seem like a role with long-term prospects.
| stabbles wrote:
| It gets it wrong when you give it 728. It claims (728, 182,
| 546). I won't share the answer so it won't appear in the
| next training set.
| thaumasiotes wrote:
| > So it's likely that it's part of the training data by now.
|
| I don't think this means what you think it means.
|
| I did some interacting with the Tencent model that showed up
| here a couple days ago (
| https://news.ycombinator.com/item?id=43447254 ):
|
| > Ni Zhi Dao Ying Guo De Steeleye Span Ma ?
|
| >> Shi De ,Wo Zhi Dao Ying Guo Chuan Qi Min Yao Yao Gun Le
| Dui Steeleye Span!Ta Men Shi 1970Nian Dai Ying Guo Min Yao
| Fu Xing Yun Dong De Zhong Yao Dai Biao Zhi Yi ,Yi Rong He
| Chuan Tong Ying Guo Min Yao Yu Yao Gun Le Qi (Ru Dian Ji Ta ,
| Gu )De Du Te Feng Ge Wen Ming . Le Dui He Xin Cheng Yuan Bao
| Gua Maddy Prior(Biao Zhi Xing Nu Sheng )He Tim Hart(Duo
| Mian Shou Le Shou ),Hou Jia Ru De Nigel Pegrum(Chuan Qi Gu
| Shou )Ye Wei Ta Men De Yin Xiao Zeng Tian Liao Hou Zhong Gan
| .
|
| >> Ta Men De Jing Dian Zuo Pin Bao Gua :
|
| >> - <<Gaudete>> :Yi Shou Zhong Shi Ji La Ding Wen Sheng Ge
| De Zhen Han Gai Bian ,Ceng Zai Ying Guo Dan Qu Bang Chong Dao
| Di 14Wei ;
|
| >> - <<All Around My Hat>> :Tong Ming Zhuan Ji Zhu Da Ge
| ,Cheng Wei Min Yao Yao Gun Biao Gan ;
|
| >> - <<Cam Ye O'er Frae France>> :Dui Ying Lun Li Shi De Xi
| Nue Quan Shi .
|
| >> Le Dui De Cheng Yuan Bian Dong Pin Fan ,Dan Shi Zhong Bao
| Chi Zhao Dui Chuan Tong Yin Le De Chuang Zao Xing Zhong Gou .
| Jin Nian Lai Maddy PriorReng Huo Yue Yu Yin Le Jie ,Ruo Xiang
| Gan Shou Ta Men Kuang Ye De Wu Tai Zhang Li ,Tui Jian Cong
| 1970Nian Dai De Huang Jin Shi Qi Zuo Pin Ru Shou !Xu Yao Ju
| Ti Tui Jian Ge Dan Huo Bei Jing Gu Shi Ma ? [happy blushing
| emoji]
|
| > Wo Xi Huan Ta Men Chang De The Victory Na Shou Ge . Ni Hui
| Zen Yao Miao Shu ?
|
| >> Nin Ti Dao De <<The Victory>> Ke Neng Shi Zhi Steeleye
| Span De Mou Shou Zuo Pin ,Dan Ta Men De Zheng Shi Zhuan Ji
| Zhong Bing Mei You Tong Ming Ge Qu De Ji Lu .
|
| >> [further suggestions omitted]
|
| > Yi ?Steeleye Span De Storm Force Ten Bao Gua Na Shou Ge ?
|
| >> Nin Ti Dao De <<Storm Force Ten>> Shi Steeleye Span De
| Di Shi Zhang Lu Yin Shi Zhuan Ji ,Fa Bu Yu 1975Nian . Zhe
| Zhang Zhuan Ji Yan Xu Liao Le Dui Jiang Chuan Tong Ying Guo
| Min Yao Yu Ying Yao Gun Rong He De Feng Ge ,Tong Shi Rong Ru
| Liao Kai Er Te , Zhong Shi Ji He Xu Shi Xing Yuan Su ,Zheng
| Ti Chong Man Huo Li Yu Xi Ju Zhang Li . Yi Xia Shi Zhuan Ji
| Zhong De Wan Zheng Qu Mu Lie Biao Ji Liang Dian Jie Xi :
|
| >> Zhuan Ji Qu Mu
|
| >> 1.<<Roll Right Stones>> - Yi Ju Shi Zhen Chuan Shuo Wei
| Ling Gan ,Kai Chang De Di Sheng Yu Dian Ji Ta Jiao Zhi Chu
| Shen Mi Fen Wei ,Maddy Prior De Sang Yin Ru Shi Shi Yin Song
| .
|
| >> 2.<<Let Her Go Down>> - Min Yao Ji Di Da Pei Yao Gun Jie
| Zou ,Ge Ci An Yu Hang Hai Mou Xian Yu Ming Yun Kang Zheng ,Gu
| Dian Mi Ji Ru Lang Yong .
|
| >> 3.<<Hark! The Heron Voiced>> - Zhong Shi Ji Feng Ge Xu Shi
| Qu ,Miao Hui Zi Ran Yu Ling Xing De Dui Hua ,Chang Di Yu Ti
| Qin De Dui Hua Gan Ji Qiang .
|
| >> 4.<<The Brisk Young Widow>> - Qing Kuai De Wu Qu Jie Zou
| ,Jiang Shu Gua Fu Yu Lang Zi Jian De Hui Xie Gu Shi ,Zhan
| Xian Le Dui You Mo Yi Mian .
|
| >> 5.<<The Storm Force Ten>> - Tong Ming Zhu Da Ge ,Yi Bao
| Feng Yu Bi Yu Ai Qing Chong Tu ,Ji Ta soloRu Shan Dian Ban Si
| Lie ,Gao Chao Bu Fen Ji Ju Bao Fa Li .
|
| >> 6.<<The Green Knight>> - Yuan Zi Ya Se Wang Chuan Shuo ,Qi
| Shi Yu Zi Ran De Dui Kang Zhu Ti ,Guan Feng Qin Yu Ji Ta De
| Yin Se Ying Zao Chu Gu Lao Zhan Chang Qi Xi .
|
| >> 7.<<The Blacksmith>> - Gai Bian Zi Chuan Tong Min Yao
| ,Zhong Yao Gun Bian Qu Rang Tie Jiang De Chui Ji Sheng Hua
| Zuo Jie Zou Ji Shi ,Maddy De Ji Xing Gao Yin Liang Yan .
|
| >> 8.<<Cam Ye O'er Frae France>> - Zheng Zhi Feng Ci Min Yao
| ,Diao Kan Ying Guo Wang Shi Li Shi ,Shuang Ji Ta Hu Biao Ru
| Dao Jian Jiao Feng .
|
| >> 9.<<The Maid of Llanwellyn>> - Kai Er Te Feng Ge Shu Qing
| Qu ,Xian Le Bian Pai Ru Qi Ru Su ,Zhan Xian Le Dui Rou Qing
| Yi Mian .
|
| >> 10.<<The Whisky Jacobites>> - Qing Zhu Su Ge Lan Du Li
| Jing Shen De Kuang Huan Qu ,Shou Feng Qin Yu Yao Gun Le Qi De
| Peng Zhuang Chong Man Jie Ri Qi Fen .
|
| >> Zhuan Ji Liang Dian
|
| >> - Li Shi Yu Chuan Shuo De Chen Jin Gan :Cong Ya Se Wang
| Dao Su Ge Lan Qi Yi ,Ge Ci Ru Yi Bu Wei Xing Shi Shi .
|
| >> - Yin Se Shi Yan Xing :Guan Feng Qin , Shou Feng Qin Yu
| Dian Ji Ta Hun Da Chu Du Te De "Min Yao Zhong Jin Shu "Zhi
| Gan .
|
| >> - Dong Tai Dui Bi :Kuai Ru Feng Bao De Yao Gun Duan Luo Yu
| Jing Mi De Min Yao Xu Shi Jiao Ti Chu Xian ,Zhang Li Shi Zu .
|
| >> Ruo Nin Pian Ai <<The Victory>> De Hui Hong Gan ,Tui Jian
| You Xian Ting <<The Storm Force Ten>> He <<The Green
| Knight>> ,Zhe Liang Shou De Shi Shi Ji Bian Qu Kan Cheng Le
| Dui Dian Feng Zhi Zuo !Xu Yao Shi Ting Lian Jie Huo Ge Ci Jie
| Xi Ma ? [sunglasses emoji]
|
| This is a question that obviously was in the training data.
| How do you get the answer back out of the training data?
| drexlspivey wrote:
| And if it wasn't, it is now
| ototot wrote:
| I also tried one-shot.
|
| https://g.co/gemini/share/badd00a824d2
| yifanl wrote:
| You'd have better results if you had prompted it with the
| actual answer and asked how the first person came to the
| conclusion. Giving a number in the training set is very easy.
|
| i.e. You observe three people in a magical room. The first
| person is standing underneath a 65, the second person is
| standing underneath a 26 and the third person is standing
| underneath a 39. They can see the others numbers but not the
| one they are directly under. You tell them one of the three
| numbers is the sum of the other two and all numbers are
| positive integers. You ask the first person for their number,
| they respond that they don't know. You ask the second person
| for their number, they respond that they don't know. You ask
| the third person, they respond that they don't know. You ask
| the first person again and they respond with the correct value,
| how did they know?
|
| And of course, if it responds with a verbatim answer in the
| line of https://www.reddit.com/r/math/comments/32m611/logic_que
| stion..., we can be pretty confident what's happening under the
| hood.
| SwayStar123 wrote:
| You can also put the AI in the first person's shoes. Prompt:
| You are standing in a circle, there are 2 other people in the
| circle with you, everyone in the circle, has a positive integer
| above their head, no one knows what the number above their own
| head is but can see the numbers above the heads of the other
| people. You see that the person infront of you on the left has
| 26 above their head. The person on the right has 39 above their
| head. You are told that the sum of two of the numbers is the
| third number. You are asked what the number above your head is,
| the option is the sum, 65, or 13, as 26 + 13 = 39. You don't
| know which one it is, and you say so. The second person is
| asked the number above their head. They also say they dont
| know, the third person also says they dont know. What is your
| number?
|
| Gemini 2.5 and claude 3.7 thinking get it right, o3 mini and 4o
| get it wrong
| adpirz wrote:
| Interactive playground for the puzzle:
| https://claude.site/artifacts/832e77d7-5f46-477c-a411-bdad10...
|
| (All state is stored in localStorage so you can come back to it
| :) ).
| doener wrote:
| ChatGPT 4.5:
|
| Let's carefully analyze this puzzle step-by-step:
|
| [?]
|
| Step 1: Define the Problem Clearly
|
| We have three people in a circle. Let's call them: * Person A
| (first person) * Person B (second person) * Person C (third
| person)
|
| Each has a positive integer above their head. Each sees the
| other two numbers but not their own. The numbers satisfy: A + B
| = C \quad \text{or} \quad B + C = A \quad \text{or} \quad C + A
| = B
|
| Given the puzzle scenario, we need to find out the exact
| numbers given the clues, and ultimately find the product of the
| three numbers.
|
| [?]
|
| Step 2: Logic of the Puzzle (Step-by-Step Reasoning)
|
| Initial conditions: The numbers are positive integers. The
| puzzle implies a situation of ambiguity at first, but after
| multiple rounds of "I don't know," Person A finally knows their
| number is 65.
|
| Let's denote clearly the conditions again: * Exactly one of the
| following three equations is true: * A + B = C, or * B + C = A,
| or * C + A = B.
|
| Person A sees B and C, but not A. Similarly for Persons B and
| C.
|
| [?]
|
| Step 3: Logical deductions from the repeated "I don't know"
|
| First round deductions: * Person A initially doesn't know their
| number. This means, from Person A's perspective, both scenarios
| A = B + C and A + B = C (or A + C = B) are still possible. If
| Person A had seen identical numbers, for instance, Person A
| could have instantly known their number. Thus, Person A sees
| two different numbers. * Similarly, Person B doesn't know
| either, implying that Person B also sees ambiguity in their own
| number. Thus, Person B also sees two numbers that could result
| in at least two possible solutions. * Person C also doesn't
| know, same logic as above.
|
| Thus, after one full round, we understand that initially, all
| three numbers must create ambiguity. No trivial case (like two
| identical numbers) is present.
|
| Second round deductions:
|
| After hearing that no one knows their number, each person gains
| information: they now know the scenario is one in which none of
| the other two people could have immediately solved it.
|
| Crucially, in the second round: * Person A now knows their
| number is 65 after considering the previous statements. This
| implies that Person A sees numbers B and C, and from these
| numbers plus the statements of uncertainty, Person A deduces
| that their own number can only be 65.
|
| [?]
|
| Step 4: Deriving the Numbers (Critical deduction)
|
| Let's denote clearly again: Person A's number is now known to
| be 65.
|
| We have three possible equations. Let's test each carefully
| given A = 65: 1. Case 1: A + B = C Then we have: * 65 + B = C.
| Person A sees numbers B and C, so they would see B and 65 + B.
| Person A sees numbers B and 65+B. If Person A sees these two
| numbers, they might initially consider their number could be
| either the sum or difference of these two. * However, if Person
| A sees two numbers, and one is exactly 2x the other, Person A
| could immediately guess their number is the difference (since
| if you see numbers X and 2X, your number is clearly X). So this
| scenario of B and 65+B is okay only if these two numbers don't
| exhibit an immediate trivial scenario. * Let's hold this
| scenario and continue examining others first. 2. Case 2: B + C
| = A (So, B + C = 65) * Then Person A sees numbers B and C,
| adding exactly to 65. * Initially, Person A sees two numbers
| summing exactly to 65. Could Person A know their number
| immediately at first glance? No, since their number could
| either be the sum (65) or the difference of the two numbers.
| Ambiguity here initially makes sense. * Initially, Person B and
| Person C also see ambiguous scenarios. They both don't know
| their numbers either at first round. This scenario is
| plausible. 3. Case 3: C + A = B (So, C + 65 = B) * Then Person
| A sees numbers B = C + 65 and C. * Person A sees C and C+65.
| Similar reasoning as Case 1.
|
| [?]
|
| Step 5: Identifying the Correct Scenario
|
| The puzzle's logic typically aligns best with Case 2: B + C = A
| = 65.
|
| Here's why clearly stated: * Initially, Person A sees two
| numbers adding to exactly 65. At first glance, Person A's
| number could either be 65 (sum) or the difference between the
| two numbers. * Person B also initially sees two numbers and
| cannot distinguish immediately their own number, nor can Person
| C. * After everyone states "I don't know," Person A gains
| critical insight: if the difference between the two numbers
| (that A sees) allowed Person B or Person C to know their number
| immediately, they'd have known right away. The fact that B and
| C didn't know means the difference is not immediately obvious
| to B or C. Thus, the difference cannot be trivial (such as
| identical numbers or easy multiples).
|
| Therefore, the two numbers must be such that their difference
| doesn't immediately clue in the other persons. This occurs most
| clearly with pairs of numbers which have multiple factorization
| possibilities or aren't trivial multiples.
|
| But we must carefully test quickly to find a suitable,
| plausible integer pair whose sum is 65 and whose ambiguity
| would logically survive through multiple rounds. The puzzle is
| a classic logic puzzle with a known structure:
|
| A common scenario in these classic puzzles is the use of
| Fibonacci-like or similar "golden" sets. Let's quickly check
| the following well-known ambiguous triple: * Numbers: 65, 39,
| 26 * Check quickly: indeed, 39 + 26 = 65.
|
| Checking logic quickly: * Person A sees 39 and 26. Initially
| ambiguous (number could be 65 or 13). * Person B sees 65 and
| 26. Could be either 39 or 91 at first glance. Initially
| ambiguous. * Person C sees 65 and 39. Could be either 26 or 104
| initially. Initially ambiguous.
|
| After each person says they don't know: * Now, Person A
| considers: If my number was 13, then Person B would see numbers
| 13 and 26. Person B would then immediately know their number is
| 39 (13+26=39), no ambiguity. But Person B didn't know! So
| Person A's number cannot be 13. Hence, Person A confidently
| concludes their number is 65.
|
| Thus, the numbers 65, 39, and 26 satisfy the puzzle conditions
| perfectly!
|
| [?]
|
| Step 6: Final Answer (Product of the Numbers)
|
| The three numbers identified clearly are: * A = 65 * B = 39 * C
| = 26
|
| Then, the product is: 65 \times 39 \times 26 = 65 \times 1014 =
| 65,910
|
| [?]
|
| Final Answer:
|
| \boxed{65,910}
| _cs2017_ wrote:
| This is solvable in roughly half an hour on pen and paper by a
| random person I picked with no special math skills (beyond a
| university). This is far from a difficult problem. The "95%+"
| in math reasoning is a meaningless standard, it's like saying a
| model is better than 99.9% of world population in Albanian
| language, since less than 0.1% bother to learn Albanian.
|
| Even ignoring the fact that this or similar problem may have
| appeared in the training data, it's something a careful brute-
| force math logic should solve. It's neither difficult, nor
| interesting, nor useful. Yes, it may suggest a slight
| improvement on the basic logic, but no more so than a million
| other benchmarks people quote.
|
| This goes to show that evaluating models is not a trivial
| problem. In fact, it's a hard problem (in particular, it's a
| far far harder than this math puzzle).
| highfrequency wrote:
| Fun puzzle! I'm curious how you managed to structure the
| problem such that a computer could solve it but it took 3 days
| of computation?
| dkjaudyeqooe wrote:
| > I think it's not an exaggeration to say LLMs are now better
| than 95+% of the population at mathematical reasoning.
|
| It's not an exaggeration it's a non-sequitur, you first have to
| show that the LLMs are reasoning in the same way humans do.
| r0fl wrote:
| Wow
|
| Tried this in deepseek and grok and it kept thunking in loops
| for a while and I just turned it off
|
| I haven't seen a question loop this long ever.
|
| Very impressed
| honeybadger1 wrote:
| Claude is still the king right now for me. Grok is 2nd in line,
| but sometimes it's better.
| nickandbro wrote:
| Wow, was able to nail the pelican riding on a bicycle test:
|
| https://www.svgviewer.dev/s/FImn7kAo
| anon373839 wrote:
| That's actually too good to believe. I have a feeling simonw's
| favorite test has been special-cased...
| Workaccount2 wrote:
| It seems pretty good at it. The hair on the boy is messed up,
| but still decent.
|
| "A boy eating a sandwhich"
|
| https://www.svgviewer.dev/s/VhcGxnIR
|
| "A multimeter"
|
| https://www.svgviewer.dev/s/N5Dzrmyt
| sebzim4500 wrote:
| I doubt it is explicitly special cased, but now that it's all
| over twitter etc. it will have ended up many times in the
| training data.
| KTibow wrote:
| They could've RLed on SVGs - wouldn't be hard to render them,
| test adherence through Gemini or CLIP, and reward fittingly
| xnx wrote:
| It will be huge achievement if models can get to the point where
| so much selection effort isn't required: gemini.google.com
| currently lists 2.0 Flash, 2.0 Flash Thinking (experimental),
| Deep Research, Personalization (experimental), and 2.5 Pro
| (experimental) for me.
| lxgr wrote:
| There's probably a sweet spot here. On the flip side, ChatGPT
| currently doesn't indicate whether a given image generation
| request was serviced by multimodal GPT-4o [1] or Dall-E.
|
| Personally, I do like the "use web search" and "extended
| thinking" buttons, but ultimately, the models should probably
| be able to figure out whether doing so would be useful
| themselves too.
|
| [1] https://news.ycombinator.com/item?id=43474112
| resource_waste wrote:
| I'll try it tonight, but I'm not excited, its just work.
|
| ChatGPT4.5, I was excited.
|
| Deepseek, I was excited. (then later disappointed)
|
| I know Gemini probably wont answer any medical question, even if
| you are a doctor. ChatGPT will.
|
| I know I've been disappointed at the quality of Google's AI
| products. They are backup at best.
| FergusArgyll wrote:
| It interpreted blood work for me
|
| (Everything's ok, I'm just testing it ;)
| fourseventy wrote:
| Does it think the founding fathers were a diverse group of mixed
| races and genders like the last model did?
| eenchev wrote:
| "My info, the stuff I was trained on, cuts off around early
| 2023." - Gemini 2.5 to me. Appears that they did a not-so-recent
| knowledge cutoff in order to use the best possible base model.
| staticman2 wrote:
| It's unlikely the model knows its actual cutoff date. Try
| asking 2024 news- for example in my test it knows the January
| 2024 Oscar nominees.
|
| On AI studio the model told me today is June 13 2024.
| afro88 wrote:
| Is this the first model announcement where they show Aider's
| Polyglot benchmark in the performance comparison table? That's
| huge for Aider and anotherpaulg!
| Dowwie wrote:
| This model is a fucking beast. I am so excited about the
| opportunities this presents.
| arjun_krishna1 wrote:
| I've been using Gemini Pro for my University of Waterloo capstone
| engineering project. Really good understanding of PDF documents
| and good reasoning as well as structured output Recommend trying
| it out at aistudio dot google dot com
| summerlight wrote:
| This looks like the first model where Google seriously comes back
| into the frontier competition? 2.0 flash was nice for the price
| but it's more focused on efficiency, not the performance.
| dcchambers wrote:
| > Developers and enterprises can start experimenting with Gemini
| 2.5 Pro in Google AI Studio now, and Gemini Advanced users can
| select it in the model dropdown on desktop and mobile. It will be
| available on Vertex AI in the coming weeks.
|
| I'm a Gemini Advanced subscriber, still don't have this in the
| drop-down model selection in the phone app, though I do see it on
| the desktop webapp.
| ehsankia wrote:
| I see it in both, probably just some gradual rollout delays.
| cp9 wrote:
| does it still suggest glue on pizza
| lvl155 wrote:
| With recent pace of model updates, I wonder which factor is more
| important: hardware assets, software/talent, or data access.
| Google clearly is in the lead in terms of data access in my view.
| If I am a top talent in AI, I'd go where I can work with the best
| data no?
| CuriouslyC wrote:
| The problem Goog has is its insane bureaucracy and lack of
| vision from Sundar, which isn't very attractive from an
| employee position. If you're working close to Demis I imagine
| the situation is better though.
| rfoo wrote:
| Now that Noam is back I'm a little bit more optimistic.
| danpalmer wrote:
| I think an argument could be made for hardware too. Perhaps in
| absolute terms Nvidia is ahead, but in terms of knowing how to
| get the most out of the hardware, Google making its own chips,
| building on their networking, etc, is a pretty big advantage.
|
| (Disclaimer, Googler, but I don't work on any of this, I only
| have an external layperson's understanding of it)
| d3nj4l wrote:
| A model that is better on Aider than Sonnet 3.7? For _free_ ,
| right now? I think I'll give it a spin this weekend on a couple
| of projects, seems too good to be true.
| istjohn wrote:
| With a rate limit of 50 requests per day
| pachico wrote:
| It really surprises me that Google and Amazon, considering their
| infrastructure and the urge to excel at this, aren't leading the
| industry.
| WasimBhai wrote:
| I do not intend to take anything away from the technical
| achievement of the team. However, as Satya opined some weeks
| back, these benchmarks do not mean a lot if we do not see a
| comparable increase in productivity.
|
| But then there are two questions. First, are the white collar
| workers specifically consultants, engineers responsible for
| increase in productivity? Or is the white collar workers at the
| very right tail e.g., scientists?
|
| I think consultants and engineers are using these technologies a
| lot. I think biologists at least are using these models a lot.
|
| But then where is the productivity increases?
| ZeroTalent wrote:
| Also, why not compare to GPT-o3 in the benchmarks?
| KoolKat23 wrote:
| The models not really available.
| KoolKat23 wrote:
| As it stands currently I'd say this is difficult to measure.
|
| They're not baked into workflows where the measurable output is
| attributed easily to the model use. Productivity in its current
| form is transformative in the sense that the use case and gain
| differs for the individual (who even provide different
| prompts). So some are keeping the gains for themselves, others
| are using it to improve quality rather than quantity.
|
| It'll come in time, it's important to remember gpt 4 was
| released 2 years ago this month. The newer models are more
| reliable and could probably be introduced into workflows more
| frequently. Today I spoke to a company who are looking to use
| it to reduce cost in the next year.
| danpalmer wrote:
| That's true, but productivity has many factors and takes a long
| time to get confidence on. Any productivity value that could be
| stated clearly would have similar downsides to a benchmark, and
| take far longer.
|
| Benchmarks are useful as leading indicators. Early warning
| signs. If there's no relation to the eventual productivity then
| hopefully that benchmark will disappear as it's not useful.
|
| In a fast moving space like this it's reasonable to make use of
| leading indicators.
| jorl17 wrote:
| Just a couple of days ago I wrote on reddit about how long
| context models are mostly useless to me, because they start
| making too many mistakes very fast. They are vaguely helpful for
| "needle in a haystack" problems, not much more.
|
| I have a "test" which consists in sending it a collection of
| almost 1000 poems, which currently sit at around ~230k tokens,
| and then asking a bunch of stuff which requires reasoning over
| them. Sometimes, it's something as simple as "identify key
| writing periods and their differences" (the poems are ordered
| chronologically). Previous models don't usually "see" the final
| poems -- they get lost, hallucinate and are pretty much
| worthless. I have tried several workaround techniques with
| varying degrees of success (e.g. randomizing the poems).
|
| Having just tried this model (I have spent the last 3 hours
| probing it), I can say that, to me, this is a breakthrough
| moment. Truly a leap. This is the first model that can
| consistently comb through these poems (200k+ tokens) and analyse
| them as a whole, without significant issues or problems. I have
| no idea how they did it, but they did it.
|
| The analysis of this poetic corpus has few mistakes and is very,
| very, very good. Certainly very good in terms of how quickly it
| produces an answer -- it would take someone days or weeks of
| thorough analysis.
|
| Of course, this isn't about poetry -- it's about passing in huge
| amounts of information, without RAG, and having a high degree of
| confidence in whatever reasoning tasks this model performs. It is
| the first time that I feel confident that I could offload the
| task of "reasoning" over large corpus of data to an LLM. The
| mistakes it makes are minute, it hasn't hallucinated, and the
| analysis is, frankly, better than what I would expect of most
| people.
|
| Breakthrough moment.
| Alifatisk wrote:
| Two years ago, Claude was known for having the largest context
| window and being able to remember tokens throughout the whole
| conversation.
|
| Today, it seems like Google has beat them and supports way
| larger context window and is way better at keeping track of
| what has being said and memorize older tokens.
| greatgib wrote:
| If you plan to use Gemini, be warned, here are the usual Big Tech
| dragons: Please don't enter ...confidential info
| or any data... you wouldn't want a reviewer to see or Google to
| use ...
|
| The full extract of the terms of usage: How
| human reviewers improve Google AI To help with
| quality and improve our products (such as the generative machine-
| learning models that power Gemini Apps), human reviewers
| (including third parties) read, annotate, and process your Gemini
| Apps conversations. We take steps to protect your privacy as part
| of this process. This includes disconnecting your conversations
| with Gemini Apps from your Google Account before reviewers see or
| annotate them. Please don't enter confidential information in
| your conversations or any data you wouldn't want a reviewer to
| see or Google to use to improve our products, services, and
| machine-learning technologies.
| greatgib wrote:
| And even worse: Conversations that have been
| reviewed or annotated by human reviewers (and related data like
| your language, device type, location info, or feedback) are not
| deleted when you delete your Gemini Apps activity because they
| are kept separately and are not connected to your Google
| Account. Instead, they are retained for up to three years.
|
| Emphasis on "retained for up to three years" even if you delete
| it!!
| mastodon_acc wrote:
| How does it compare to OpenAI and anthropic's user data
| retention policy?
| greatgib wrote:
| If i'm not wrong, Chatgpt states clearly that they don't
| use user data anymore by default.
|
| Also, maybe some services are doing "machine learning"
| training with user data, but it is the first time I see
| recent LLM service saying that you can feed your data to
| human reviewers at their will.
| cubefox wrote:
| They seem to use it as long as the chat history is
| enabled, similar to Gemini.
| https://help.openai.com/en/articles/7792795-how-do-i-
| turn-of...
| kccqzy wrote:
| Well they can't delete a user's Gemini conversations because
| they don't know which user a particular conversation comes
| from.
|
| This seems better, not _worse_ , than keeping the user-
| conversation mapping so that the user may delete their
| conversations.
| mastodon_acc wrote:
| I mean this is pretty standard for online llms. What is Gemini
| doing here that openai or Anthropic aren't already doing?
| KoolKat23 wrote:
| I don't think this is the same as the AI studio and API terms.
| This looks like your consumer facing Gemini T&C's.
| sauwan wrote:
| I'm assuming this is true of all experimental models? That's
| not true with their models if you're on a paid tier though,
| correct?
| summerlight wrote:
| You can use a paid tier to avoid such issues. Not sure what
| you're expecting for those "experimental" models, which is in
| development and needs user feedback.
| suyash wrote:
| More of a reason for new privacy guidelines specially for big
| tech and AI
| cavisne wrote:
| Google is the best of these. You either pay per token and there
| is no training on your inputs, or it's free/a small monthly fee
| and there is training.
| TheMagicHorsey wrote:
| I tested out Gemini 2.5 and it failed miserably at calling into
| tools that we had defined for it. Also, it got into an infinite
| loop a number of times where it would just spit out the exact
| same line of text continuously until we hard killed the process.
| I really don't know how others are getting these amazing results.
| We had no problems using Claude or OpenAI models in the same
| scenario. Even Deepseek R1 works just fine.
| batata_frita wrote:
| Why do I have the feel that nobody is too much excited to
| google's models compared to other companies?
| Mond_ wrote:
| Google is worse at marketing and hyping people up.
| bklyn11201 wrote:
| Most of us care only about coding performance, and Sonnet 3.5
| has been such a giant winner that we don't get too excited
| about the latest model from Google.
| SamuelAdams wrote:
| Because it's more likely to be sunsetted.
|
| https://killedbygoogle.com/
| owlninja wrote:
| Yea I get a little bummed but I guess a lot of HNers have
| reasons to not like google. I've had a Google One membership
| forever so opted for the higher subscription with Gemini access
| since the beginning (plus a free year with new Pixel phone).
| and I think it is awesome.
| ipsum2 wrote:
| They're not good models. They over fit to LMArena leaderboard,
| but perform worse in real life scenarios compared to their
| competitors.
|
| The exceptions are auto regressive image generation and audio
| models.
| Davidzheng wrote:
| I feel like Google intentionally don't want people to be as
| excited. This is a very good model. Definitely the best
| available model today.
| crimsoneer wrote:
| Because most of the LLM hype is still generated by people who
| don't use them in production, and those people don't use GCP
| asah wrote:
| It nailed my two hard reasoning+linguistic+math questions in one
| shot, both the kinds of things that LLM struggle but humans do
| well.
|
| (DM me for the questions)
| rodolphoarruda wrote:
| I've been trying to use Gemini 2.0 Flash, but I don't think it's
| possible. The model still thinks it's running the 1.5 Pro model.
|
| Reference: https://rodolphoarruda.pro.br/wp-
| content/uploads/image-14.pn...
| strstr wrote:
| It's a lot better at my standard benchmark "Magic: The Gathering"
| rules puzzles. Gets the answers right (both the outcome and
| rationale).
| strstr wrote:
| Ooof, it failed my "Wheel of Potential" bug finding question,
| and got aggressive about asserting it was correct.
| Davidzheng wrote:
| On initial thoughts, I think this might be the first AI model to
| be reliably helpful as a research assistant in pure mathematics
| (o3-mini-high can be helpful but is more prone to hallucinations)
| simonw wrote:
| I'm impressed by this one. I tried it on audio transcription with
| timestamps and speaker identification (over a 10 minute MP3) and
| drawing bounding boxes around creatures in a complex photograph
| and it did extremely well on both of those.
|
| Plus it drew me a very decent pelican riding a bicycle.
|
| Notes here: https://simonwillison.net/2025/Mar/25/gemini/
| kridsdale3 wrote:
| I just tried your trademark benchmark on the new 4o Image
| Output, though it's not the same test:
|
| https://imgur.com/a/xuPn8Yq
| Unroasted6154 wrote:
| Was that an actual SVG?
| simonw wrote:
| No that's GPT-4o native image output.
| sebzim4500 wrote:
| I wonder how far away we are from models which, given
| this prompt, generate that image in the first step in
| their chain-of-thought and then use it as a reference to
| generate SVG code.
|
| It could be useful for much more than just silly
| benchmarks, there's a reason why physics students are
| taught to draw a diagram before attempting a problem.
| simonw wrote:
| Someone managed to get ChatGPT to render the image using
| GPT-4o, then save that image to a Code Interpreter
| container and run Python code with OpenCV to trace the
| edges and produce an SVG:
| https://bsky.app/profile/btucker.net/post/3lla7extk5c2u
| og_kalu wrote:
| One of the biggest problems with hands off LLM writing (for long
| horizon stuff like novels) is that you can't really give them any
| details of your story because they get absolutely neurotic with
| it.
|
| Imagine for instance you give the LLM the profile of the love
| interest for your epic fantasy, it will almost always have the
| main character meeting them within 3 pages (usually page 1) which
| is of course absolutely nonsensical pacing. No attempt to tell it
| otherwise changes anything.
|
| This is the first model that after 19 pages generated so far
| resembles anything like normal pacing even with a TON of details.
| I've never felt the need to generate anywhere near this much.
| Extremely impressed.
|
| Edit: Sharing it -
| https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
|
| with pastebin - https://pastebin.com/aiWuYcrF
| fshr wrote:
| I think you would be better off having the LLM help you build
| up the plot with high level chapter descriptions and then have
| it dig into each chapter or arc. Or start by giving it the
| beats before you ask it for help with specifics. That'd be
| better at keeping it on rails.
| og_kalu wrote:
| I don't disagree. Like with almost anything else involving
| LLMs, getting hands on produces better results but because in
| this instance, i much prefer to be the reader than the author
| or editor, it's really important to me that a LLM is capable
| of pacing long form writing properly on its own.
| Workaccount2 wrote:
| Can you share it on a text sharing site? It seems you hit your
| share quota
| og_kalu wrote:
| Sure - https://pastebin.com/aiWuYcrF
| nout wrote:
| Using the AI in multiple phases is the approach that can handle
| this. Similarly to "Deep Research" approach - you can tell it
| to first generate a storyline with multiple twists and turns.
| Then ask the model to take this storyline and generate prompts
| for individual chapters. Then ask it to generate the individual
| chapters based on the prompts, etc.
| mclau156 wrote:
| Generated 1000 lines of turn based combat with shop, skills,
| stats, elements, enemy types, etc. with this one
| nikcub wrote:
| Impressive model - but I'm confused by the knowledge cutoff. AI
| Studio says it is January 2025 (which would be impressive) but
| querying it for anything early 2025 or mid/late 2024 and it self-
| reports that it's cutoff is in 2023 (which can't be right).
|
| This is most evident when querying about fast-moving dev tools
| like uv or bun. It seems to only know the original uv options
| like pip and tools, while with bun it is unfamiliar with bun
| outdated (from Aug 2024), bun workspaces (from around that time?)
| but does know how to install bun on windows (April 2024).
|
| You'll still need to provide this model with a lot of context to
| use it with any tooling or libraries with breaking changes or new
| features from the past ~year - which seems to contradict the AI
| Studio reported knowledge cutoff.
|
| Were I developing models - I'd prioritise squeezing in the most
| recent knowledge of popular tools and libraries since development
| is such a popular (and revenue generating) use case.
| joshdavham wrote:
| When these companies release a model "2.5", are they using some
| form of semver? Where are these numbers coming from?
| zone411 wrote:
| Scores 54.1 on the Extended NYT Connections Benchmark, a large
| improvement over Gemini 2.0 Flash Thinking Experimental 01-21
| (23.1).
|
| 1 o1-pro (medium reasoning) 82.3
|
| 2 o1 (medium reasoning) 70.8
|
| 3 o3-mini-high 61.4
|
| 4 Gemini 2.5 Pro Exp 03-25 54.1
|
| 5 o3-mini (medium reasoning) 53.6
|
| 6 DeepSeek R1 38.6
|
| 7 GPT-4.5 Preview 34.2
|
| 8 Claude 3.7 Sonnet Thinking 16K 33.6
|
| 9 Qwen QwQ-32B 16K 31.4
|
| 10 o1-mini 27.0
|
| https://github.com/lechmazur/nyt-connections/
___________________________________________________________________
(page generated 2025-03-25 23:00 UTC)