[HN Gopher] Gemini 2.5
       ___________________________________________________________________
        
       Gemini 2.5
        
       Author : meetpateltech
       Score  : 523 points
       Date   : 2025-03-25 17:01 UTC (5 hours ago)
        
 (HTM) web link (blog.google)
 (TXT) w3m dump (blog.google)
        
       | jasonpeacock wrote:
       | Isn't every new AI model the "most <adjective>"?
       | 
       | Nobody is going to say "Announcing Foobar 7.1 - not our best!"
        
         | wigster wrote:
         | gobble 2.0 - a bit of a turkey
        
         | minimaxir wrote:
         | GPT-4.5's announcement was the equivalent of that.
         | 
         | "It beats all the benchmarks...but you really really don't want
         | to use it."
        
           | forbiddenvoid wrote:
           | They even priced it so people would avoid using it. GPT-4.5's
           | entire function was to be the anchor of keeping OpenAI in the
           | news, to keep up the perception of releasing quickly.
        
             | glenstein wrote:
             | My assumption was that the pricing was because it really
             | was that expensive for whatever reason. I'm keeping fingers
             | crossed that they're going to do some kind of 4.5 mini at
             | some point that will be more affordable.
        
         | thethimble wrote:
         | Except for GPT 4.5 and Claude 3.7 :/
        
         | smilekzs wrote:
         | You're not wrong, but that just means the <adjective> is where
         | the bulk of information resides. The trade-off matters. Maybe
         | it's a model with good enough quality but really cheap to
         | serve. Maybe it's a model that only plays poker really well but
         | sucks at everything else because it bluffs too much. Etc. etc.
        
         | Maxatar wrote:
         | Sure but that adjective matters. Could be cheapest,
         | "intelligent", fastest, etc... it's rarely all three of them.
        
         | zamadatix wrote:
         | Stable Diffusion definitely had a few of those.
        
         | SirMaster wrote:
         | Same with new phones. The new phone is always the fastest cpu,
         | gpu, and best camera ever!
        
       | jharohit wrote:
       | why not enable Canvas for this model on Gemini.google.com?
       | Arguably the weakest link of Canvas is the terrible code that
       | Gemini 2.0 Flash writes for Canvas to run..
        
         | aoeusnth1 wrote:
         | I'm guessing it should be enabled eventually. @logankilpatrick
         | thoughts?
        
       | cj wrote:
       | Slight tangent: Interesting that they use o3-mini as the
       | comparison rather than o1.
       | 
       | I've been using o1 almost exclusively for the past couple months
       | and have been impressed to the point where I don't feel the need
       | to "upgrade" for a better model.
       | 
       | Are there benchmarks showing o3-mini performing better than o1?
        
         | boldlybold wrote:
         | I find o3 at least faster to get to the response I care about,
         | anecdotally.
        
         | FloorEgg wrote:
         | I noticed this too, I have used both o1 and o3 mini
         | extensively, and I have ran many tests on my own problems and
         | o1 solves one of my hardest prompts quite reliably but o3 is
         | very inconsistent. So from my anecdotal experience o1 is a
         | superior model in terms of capability.
         | 
         | The fact they would exclude it from their benchmarks seems
         | biased/desperate and makes me trust them less. They probably
         | thought it was clever to leave o1 out, something like "o3 is
         | the newest model lets just compare against that", but I think
         | for anyone paying attention that decision will backfire.
        
         | jnd0 wrote:
         | Probably because It is more similar to o3 in terms of
         | size/parameters as well as price (although I would expect this
         | to be at least half price)
        
         | logicchains wrote:
         | It's a reasonable comparison given it'll likely be priced
         | similarly to o3-mini. I find o1 to be strictly better than
         | o3-mini, but still use o3-mini for the majority of my agentic
         | workflow because o1 is so much more expensive.
        
         | PunchTornado wrote:
         | Why would you compare against all the models from a competitor.
         | You take their latest one that you can test. Openai or
         | anthropoc don't compare against the whole gemini family.
        
         | kmod wrote:
         | The benchmark numbers don't really mean anything -- Google says
         | that Gemini 2.5 Pro has an AIME score of 86.7 which beats
         | o3-mini's score of 86.5, but OpenAI's announcement post [1]
         | said that o3-mini-high has a score of 87.3 which Gemini 2.5
         | would lose to. The chart says "All numbers are sourced from
         | providers' self-reported numbers" but the only mention of
         | o3-mini having a score of 86.5 I could find was from this other
         | source [2]
         | 
         | [1] https://openai.com/index/openai-o3-mini/ [2]
         | https://www.vals.ai/benchmarks/aime-2025-03-24
         | 
         | You just have to use the models yourself and see. In my
         | experience o3-mini is much worse than o1.
        
       | vineyardmike wrote:
       | I wonder what about this one gets the +0.5 to the name. IIRC the
       | 2.0 model isn't particularly old yet. Is it purely marketing,
       | does it represent new model structure, iteratively more training
       | data over the base 2.0, new serving infrastructure, etc?
       | 
       | I've always found the use of the *.5 naming kinda silly when it
       | became a thing. When OpenAI released 3.5, they said they already
       | had 4 underway at the time, they were just tweaking 3 be better
       | for ChatGPT. It felt like a scrappy startup name, and now it's
       | spread across the industry. Anthropic naming their models Sonnet
       | 3, 3.5, 3.5 (new), 3.7 felt like the worst offender of this
       | naming scheme.
       | 
       | I'm a much bigger fan of semver (not skipping to .5 though), date
       | based ("Gemini Pro 2025"), or number + meaningful letter (eg 4o -
       | "Omni") for model names.
        
         | forbiddenvoid wrote:
         | I would consider this a case of "expectation management"-based
         | versioning. This is a release designed to keep Gemini in the
         | news cycle, but it isn't a significant enough improvement to
         | justify calling it Gemini 3.0.
        
           | jstummbillig wrote:
           | I think it's reasonable. The development process is just not
           | really comparable to other software engineering: It's fairly
           | clear that currently nobody _really_ has a good grasp on what
           | a model will be while they are being trained. But they do
           | have expectations. So you do the training, and then you
           | assign the increment to align the two.
        
           | KoolKat23 wrote:
           | Funnily enough, from early indications (user feedback) this
           | new model would've been worthy of the 3.0 moniker, despite
           | what the benchmarks say.
        
         | falcor84 wrote:
         | Agreed, can't everyone just use semantic versioning, with 0.1
         | increments for regular updates?
        
           | laurentlb wrote:
           | Regarding semantic versioning: what would constitute a
           | breaking change?
           | 
           | I think it makes sense to increase the major / minor numbers
           | based on the importance of the release, but this is not
           | semver.
        
             | falcor84 wrote:
             | As I see it, if it uses a similar training approach and is
             | expected to be better in every regard, then it's a minor
             | release. Whereas when they have a new approach and where
             | there might be some tradeoffs (e.g. longer runtime), it
             | should be a major change. Or if it is very significantly
             | different, then it should be considered an entirely
             | differently named model.
        
           | morkalork wrote:
           | Or drop the pretext of version numbers entirely since they're
           | meaningless here and go back to classics like Gemini
           | Experience, Gemini: Millennium Edition or Gemini New
           | Technology
        
           | joaogui1 wrote:
           | Would be confusing for non-tech people once you did x.9 ->
           | x.10
        
           | guelo wrote:
           | What would a major version bump look like for an llm?
        
         | Workaccount2 wrote:
         | At least for OpenAI, a .5 increment indicates a 10x increase in
         | training compute. This so far seems to track for 3.5, 4, 4.5.
        
         | aoeusnth1 wrote:
         | I think it's because of the big jump in coding benchmarks. 74%
         | on aider is just much, much better than before and worthy of a
         | .5 upgrade.
        
         | alphabetting wrote:
         | The elo jump and big benchmark gains could be justification
        
       | ekojs wrote:
       | > This will mark the first experimental model with higher rate
       | limits + billing. Excited for this to land and for folks to
       | really put the model through the paces!
       | 
       | From https://x.com/OfficialLoganK/status/1904583353954882046
       | 
       | The low rate-limit really hampered my usage of 2.0 Pro and the
       | like. Interesting to see how this plays out.
        
         | chipgap98 wrote:
         | Any word on what that pricing is? I can't seem to find it
        
           | rvnx wrote:
           | Traditionally at Google experimental models are 100% free to
           | use on https://aistudio.google.com (this is also where you
           | can see the pricing) with a quite generous rate limit.
           | 
           | This time, the Googler says: "good news! you will be charged
           | for experimental models, though for now it's still free"
        
             | chipgap98 wrote:
             | Right but the tweet I was responding to says: "This will
             | mark the first experimental model with higher rate limits +
             | billing. Excited for this to land and for folks to really
             | put the model through the paces!"
             | 
             | I assumed that meant there was a paid version with a higher
             | rate limit coming out today
        
             | KoolKat23 wrote:
             | I expect this might be pricier. Hoping not unusable level
             | expensive.
        
           | xnx wrote:
           | Currently free, but only 50 requests/day.
        
       | M4v3R wrote:
       | The Long Context benchmark numbers seem super impressive. 91% vs
       | 49% for GPT 4.5 at 128k context length.
        
         | Workaccount2 wrote:
         | Google has the upperhand here because they are not dependent on
         | nvidia for hardware. They make and uses their own AI
         | accelerators.
        
         | alexdzm wrote:
         | Keen to hear more about this benchmark. Is it representative of
         | chat-to-document style usecases with big docs?
        
           | sebzim4500 wrote:
           | Looks like it's this benchmark [1]. It's certainly less
           | artificial than most long context benchmarks (that are
           | basically just a big lookup table) but probably not as
           | representative as Fiction.LiveBench [2], which asks specific
           | questions about works of fanfiction (which are typically
           | excluded from training sets because they are basically porn).
           | 
           | [1] https://arxiv.org/pdf/2409.12640
           | 
           | [2] https://fiction.live/stories/Fiction-liveBench-
           | Feb-20-2025/o...
        
             | swyx wrote:
             | "MRCR (multi-round coreference resolution)" for those
             | looking for the link to Michaelangelo
        
       | falcor84 wrote:
       | I'm most impressed by the improvement on Aider Polyglot; I wasn't
       | expecting it to get saturated so quickly.
       | 
       | I'll be looking to see whether Google would be able to use this
       | model (or an adapted version) to tackle ARC-AGI 2.
        
       | serjester wrote:
       | I wish they'd mention pricing - it's hard to seriously benchmark
       | models when you have no idea what putting it in production would
       | actually cost.
        
         | impulser_ wrote:
         | It's experimental. You shouldn't be using it in production.
        
           | chipgap98 wrote:
           | > This will mark the first experimental model with higher
           | rate limits + billing. Excited for this to land and for folks
           | to really put the model through the paces! From
           | https://x.com/OfficialLoganK/status/1904583353954882046
        
       | Oras wrote:
       | These announcements have started to look like a template.
       | 
       | - Our state-of-the-art model.
       | 
       | - Benchmarks comparing to X,Y,Z.
       | 
       | - "Better" reasoning.
       | 
       | It might be an excellent model, but reading the exact text
       | repeatedly is taking the excitement away.
        
         | Mistletoe wrote:
         | I'm sure the AI helps write the announcements.
        
         | ototot wrote:
         | Maybe they just asked Gemini 2.5 to write the announcement.
        
           | cpeterso wrote:
           | And it was trained on the previous announcements.
        
             | xlbuttplug2 wrote:
             | ... which were also written by earlier Gemini versions.
        
               | belter wrote:
               | LLMs all the way down
        
               | cratermoon wrote:
               | Not all the way. At the bottom are a bunch of unpaid
               | writers and artists and a horde of low-paid mturk workers
               | in Nigeria.
        
               | rvnx wrote:
               | which was written by ChatGPT3.5
        
         | andai wrote:
         | Well hey, OpenAI did the exact opposite, and nobody liked that
         | either.
        
           | Oras wrote:
           | How did you measure "nobody" liked OpenAI announcements?
        
           | exitb wrote:
           | I think people were fine with OpenAI demos. They were less
           | fine with not actually ever releasing the demoed tech.
        
           | emptysongglass wrote:
           | I liked their announcements and demos and continue to like
           | them.
        
         | bhouston wrote:
         | > It might be an excellent model, but reading the exact text
         | repeatedly is taking the excitement away.
         | 
         | This is the commodification of models. There is nothing special
         | about the new models but they perform better on the benchmarks.
         | 
         | They are all interchangeable. This is great for users as it
         | adds to price pressure.
        
           | flir wrote:
           | Man, I hope those benchmarks actually measure something.
        
             | Legend2440 wrote:
             | I would say they are a fairly good measure of how well the
             | model has integrated information from pretraining.
             | 
             | They are not so good at measuring reasoning, out-of-domain
             | performance, or creativity.
        
           | Workaccount2 wrote:
           | Sooner or later someone is going to find "secret sauce" that
           | provides a step-up in capability, and it will be closely
           | guarded by whoever finds it.
           | 
           | As big players look to start monetizing, they are going to
           | desperately be searching for moats.
        
             | bangaladore wrote:
             | Reasoning was supposed to be that for "Open" AI, that's why
             | they go to such lengths to hide the reasoning output. Look
             | how that turned out.
             | 
             | Right now, in my opinion, OpenAI has actually a useful deep
             | research feature which I've found nobody else matches. But
             | there is no moat to be seen there.
        
               | CamperBob2 wrote:
               | If you've seen DeepSeek R1's <think> output, you'll
               | understand why OpenAI hides their own. It can be pretty
               | "unsafe" relative to their squeaky-clean public image.
        
             | cratermoon wrote:
             | Sooner or later someone is going to find the "secret sauce"
             | that allows building a stepladder tall enough to reach the
             | moon.
             | 
             | It's called the "first step fallacy", and AI hype believers
             | continue to fall for it.
        
         | diego_sandoval wrote:
         | I take this as a good thing, because they're beating each other
         | every few weeks and using benchmarks as evidence.
         | 
         | If these companies start failing to beat the competition, then
         | we should prepare ourselves for very creative writing in the
         | announcements.
        
           | gtirloni wrote:
           | The improvements have been marginal at best. I wouldn't call
           | that beating.
        
         | devsda wrote:
         | Reminds me of how nobody is too excited about flagship mobile
         | launches anymore. Most flagships for sometime now are just
         | incremental updates over previous gen and only marginally
         | better. Couple that with the chinese OEMs launching better or
         | good enough devices at a lower price point, new launches from
         | established players are not noteworthy anymore.
         | 
         | It's interesting how the recent AI announcements are following
         | the same trend over a smaller timeframe.
        
           | flakiness wrote:
           | Phones are limited by hardware manufacturing, plus maybe the
           | annual shopping cycle peaking at Christmas. People won't have
           | bought multiple iPhones even in its heyday.
           | 
           | These LLM models were supposedly limited by the training run,
           | but these point-version models are mostly post-training
           | driven, which seems to be taking less time.
           | 
           | If models were tied to a specific hardware (say, a "AI PC" or
           | whatever) the cycle would get slower and we'll get a slower
           | summer which I'm secretly wishing.
        
           | breppp wrote:
           | I think the greatest issue with buying a new phone today is
           | ironically the seamless migration.
           | 
           | once you get all your apps, wallpaper, shortcut order and
           | same OS, you really quickly get the feeling you spent 1000$
           | for the exact same thing
        
             | atonse wrote:
             | 100% agree with you.
             | 
             | But it needs to be seamless to remove any friction from the
             | purchase, but at the same time if it feels the same then we
             | felt like we wasted money.
             | 
             | So what I usually do is buy a different colored phone and
             | change the wallpaper.
             | 
             | My MacBook was the same. Seamless transition and 2 hours
             | later I was used to the new m4 speeds.
        
         | schainks wrote:
         | I wish I wish I wish Google put better marketing into these
         | releases. I've moved entire workflows to Gemini because it's
         | just _way_ better than what openai has to offer, especially for
         | the money.
         | 
         | Also, I think google's winning the race on actually integrating
         | the AI to do useful things. The agent demo from OpenAI is
         | interesting, but frankly, I don't care to watch the machine use
         | my computer. A real virtual assistant can browse the web
         | headless and pick flights or food for me. That's the real
         | workflow unlock, IMO.
        
         | belter wrote:
         | Cancelled my account long time ago. Gemini models are like a
         | McDonalds Croissant. You always give them an extra chance, but
         | they always fall apart on your hands...
        
         | tibbar wrote:
         | For me, the most exciting part is the improved long-context
         | performance. A lot of enterprise/RAG applications rely on
         | synthesizing a bunch of possibly relevant data. Let's just say
         | it's clearly a bottleneck in current models and I would expect
         | to see a meaningful % improvement in various internal
         | applications if long-context reasoning is up. Gemini was
         | already one of my favorite models for this usecase.
         | 
         | So, I think these results are very interesting, if you know
         | what features specifically you are using.
        
           | zwaps wrote:
           | But they score it on their own benchmark, on which
           | coincidentally Gemini models always were the only good ones.
           | In Nolima or Babilong we see that Gemini models still cant do
           | long context.
           | 
           | Excited to see if it works this time.
        
         | taf2 wrote:
         | We have incrementally improved 1% better then we were
         | yesterday. Our competition is 1 day behind us now.
        
         | kh_hk wrote:
         | Was going to comment the same thing, which has been bugging me
         | off lately on all announcements that start with "our" followed
         | by empty superlatives. Happy to not be alone on this!
        
         | cratermoon wrote:
         | Glaringly missing from the announcements: concrete use cases
         | and products.
         | 
         | The Achilles heel of LLMs is the distinct lack of practical
         | real-world applications. Yes, Google and Microsoft have been
         | shoving the tech into everything they can fit, but that doesn't
         | a product make.
        
           | sebzim4500 wrote:
           | ChatGPT has like 500M weekly active users, what are you on
           | about?
        
       | jnd0 wrote:
       | > with Gemini 2.5, we've achieved a new level of performance by
       | combining a significantly enhanced base model with improved post-
       | training. Going forward, we're building these thinking
       | capabilities directly into all of our models, so they can handle
       | more complex problems and support even more capable, context-
       | aware agents.
       | 
       | Been playing around with it and it feels intelligent and up to
       | date. Plus is connected to the internet. A reasoning model by
       | default when it needs to.
       | 
       | I hope they enable support for the recently released canvas mode
       | for this model soon it will be a good match.
        
         | Workaccount2 wrote:
         | It is almost certainly the "nebula" model on LLMarena that has
         | been generating buzz for the last few days. I didn't test
         | coding but it's reasoning is very strong.
        
       | joelthelion wrote:
       | Is this model going to be restricted to paying users?
        
       | vivzkestrel wrote:
       | hi, here is our new AI model, it performs task A x% better than
       | our competitor 1, task B y% better than our competitor 2 seems to
       | be the new hot AI template in town
        
       | andai wrote:
       | Can anyone share what they're doing with reasoning models? They
       | seem to only make a difference with novel programming problems,
       | like Advent of Code. So this model will help solve slightly
       | harder advent of codes.
       | 
       | By extension it should also be slightly more helpful for
       | research, R&D?
        
         | bravura wrote:
         | We're using it to RCA infrastructure incidents.
        
           | fn-mote wrote:
           | Seriously? That doesn't require a human?! Are we talking
           | about some kind of "generic" incident? (Type 3: forgot to
           | manually update the xxxx file.) Or what's going on?
        
         | liuliu wrote:
         | I found reasoning models are much more faithful at text related
         | tasks too (i.e. 1. translating long key-value pairs (i.e.
         | Localizable.strings), 2. long transcript fixing and
         | verification; 3. look at csv / tabular data and fix) probably
         | due to the reflection mechanism built into these reasoning
         | models. Using prompts such as "check your output to make sure
         | it covers everything in the input" letting the model to double-
         | check its work, avoiding more manual checks on my end.
        
         | mgens wrote:
         | Have been using them for non-interactive coding where latency
         | is not an issue. Specifically, turning a set of many free-text
         | requirements into SQL statements, so that later when an item's
         | data is entered into the system, we can efficiently find which
         | requirements it meets. The reasoning models' output quality is
         | much better than the non-reasoning models like 3.5 Sonnet, it's
         | not a subtle difference.
        
       | throwaway13337 wrote:
       | Google has this habit of 'releasing' without releasing AI models.
       | This looks to be the same?
       | 
       | I don't see it on the API price list:
       | 
       | https://ai.google.dev/gemini-api/docs/pricing
       | 
       | I can imagine that it's not so interesting to most of us until we
       | can try it with cursor.
       | 
       | I look forward to doing so when it's out. That Aider bench mixed
       | with the speed and a long context window that their other models
       | are known for could be a great mix. But we'll have to wait and
       | see.
       | 
       | More generally, it woud be nice for these kinds of releases to
       | also add speed and context window as a separate benchmark. Or
       | somehow include it in the score. A model that is 90% as good as
       | the best but 10x faster is quite a bit more useful.
       | 
       | These might be hard to mix to an overall score but they're
       | critical for understanding usefulness.
        
         | minimaxir wrote:
         | It's available now as an option in Google AI Studio and Google
         | Gemini.
        
           | ototot wrote:
           | and the price is 0.0 usd, lol
        
           | throwaway13337 wrote:
           | Thanks. I think my post lacked clarity of what I was talking
           | about. I meant that most people care about API access to use
           | with their favorite editor. It's a big limiter with grok, for
           | example.
           | 
           | But I did mingle that with my knowledge of google's history
           | of releasing without releasing these models which, as you
           | point out, isn't true with this release.
        
           | kmod wrote:
           | It's "experimental", which means that it is not fully
           | released. In particular, the "experimental" tag means that it
           | is subject to a different privacy policy and that they
           | reserve the right to train on your prompts.
           | 
           | 2.0 Pro is also still "experimental" so I agree with GP that
           | it's pretty odd that they are "releasing" the next version
           | despite never having gotten to fully releasing the previous
           | version.
        
       | barrenko wrote:
       | The incumbent has awoken.
        
       | andai wrote:
       | How does Gemini have such a big context window?
       | 
       | I thought memory requirement grows exponentially with context
       | size?
        
         | mmoskal wrote:
         | Memory grows linearly, compute grows quadratically (but with
         | small constant - until ~100k the inference will be still
         | dominated by non-quadratic factors).
        
           | xiphias2 wrote:
           | Also reusing key/values for different queries can compress
           | the KV cache, it can be an 1000x or 10000x improvement in
           | bandwidth if the model is trained for it.
        
             | mmoskal wrote:
             | Just to clarify: simple prefix KV cache doesn't require any
             | special model training. It does require the inference
             | framework to support it, but most do by now.
             | 
             | You can see dramatic improvements in latency and throughput
             | if there is a large shared prefix of the queries.
        
         | ototot wrote:
         | The Gemini 1.5 tech report do reference some papers about
         | supporting large context window.
        
         | throitallaway wrote:
         | Funnyish story: the other night I asked my Pixel 9 to generate
         | an image via Gemini, then I asked it to make a change. It
         | didn't consider the previous context, so I asked it "Are you
         | capable of keeping context?" No matter how clearly I enunciated
         | "context", it always interpreted what I was saying as
         | "contacts." After the 4th try, I said "context, spelled "c-o-n-
         | t-e-x-t" and it replied with "Ah, you meant context! Yes..."
         | 
         | This stuff has a long way to go.
        
           | andai wrote:
           | I noticed Gemini Flash 2.0 making a lot of phonetic typos
           | like that, yeah. Like instead of Basal Ganglia it said Basil
           | Ganglia.
           | 
           | I've also had it switch languages in the middle of output...
           | like one word in the middle of a sentence was randomly output
           | in some strange hieroglyphs, but when I translated them, it
           | was the right word and the sentence made sense.
        
           | dcchambers wrote:
           | I was using the conversational feature of Gemini on my phone
           | the other night and was trying to get it to read a blog post
           | to me. The AI proceeded to tell me (out loud, via voice
           | mode/speech synthesis) that it was a text based model and
           | couldn't read text out loud.
           | 
           | For as amazing as these things are, AGI they are not.
        
             | vanviegen wrote:
             | In its defense: it probably _is_ just a text model that
             | hasn 't been told that its output is being read to the
             | user.
        
           | Workaccount2 wrote:
           | I think google is digging a hole for themselves by making
           | their lightweight models be the most used model. Regardless
           | of what their heavy weight models can do, people will
           | naturally associate them with their search model or assistant
           | model.
        
         | semiinfinitely wrote:
         | TPUs have a network topology better suited for long context
         | than gpus: https://jax-ml.github.io/scaling-book/tpus/#tpu-
         | networking
         | 
         | > This nearest-neighbor connectivity is a key difference
         | between TPUs and GPUs. GPUs connect up to 256 H100s in an all-
         | to-all configuration (called a node), rather than using local
         | connections. On the one hand, that means GPUs can send
         | arbitrary data within a node in a single low-latency hop. On
         | the other hand, TPUs are dramatically cheaper and simpler to
         | wire together, and can scale to much larger topologies because
         | the number of links per device is constant.
        
         | swyx wrote:
         | https://supaiku.com/attention-is-logarithmic
        
       | og_kalu wrote:
       | From the 2.0 line, the Gemini models have been far better at
       | Engineering type questions (fluids etc) than GPT, Claude
       | especially with questions that have Images that require more than
       | just grabbing text. This is even better.
        
       | andrewinardeer wrote:
       | Google is overly cautious with their guardrails.
       | 
       | Granted, Gemini answers it now, however, this one left me shaking
       | my head.
       | 
       | https://cdn.horizon.pics/PzkqfxGLqU.jpg
        
         | slongfield wrote:
         | For better or worse, Google gets more bad press when their
         | models get things wrong compared to smaller AI labs.
        
         | rfoo wrote:
         | Ha, I still remember that super hilarious "You are under 18, so
         | you should not write C++, as it is unsafe..." log from ... a
         | year ago?
        
         | summerlight wrote:
         | Looks like they're gradually removing guardrails, it returns
         | Nixon for me.
        
       | daquisu wrote:
       | Weird, they released Gemini 2.5 but I still can't use 2.0 pro
       | with a reasonable rate limit (5 RPM currently).
        
       | noisy_boy wrote:
       | Is Gemini and Bard same? I asked it a question and it said "...
       | areas where I, as Bard, have..."
        
         | zxilly wrote:
         | There is no point in asking such questions, the model doesn't
         | know what it is on its own, and you could get many different
         | answers if you repeat it a few more times.
        
       | noisy_boy wrote:
       | Are Gemini and Bard same? I asked it a question and it said "...
       | areas where I, as Bard, have...."
        
         | guyzero wrote:
         | Normal Google rollout process: Bard is deprecated, Gemini is
         | not ready yet.
        
       | comex wrote:
       | I was recently trying to replicate ClaudePlaysPokemon (which uses
       | Claude 3.7) using Gemini 2.0 Flash Thinking, but it was seemingly
       | getting confused and hallucinating significantly more than
       | Claude, making it unviable (although some of that might be caused
       | by my different setup). I wonder if this new model will do
       | better. But I can't easily test it: for now, even paid users are
       | apparently limited to 50 requests per day [1], which is not
       | really enough when every step in the game is a request. Maybe
       | I'll try it anyway, but really I need to wait for them to
       | "introduce pricing in the coming weeks".
       | 
       | Edit: I did try it anyway and so far the new model is having
       | similar hallucinations. I really need to test my code with Claude
       | 3.7 as a control, to see if it approach the real
       | ClaudePlaysPokemon's semi-competence.
       | 
       | Edit 2: Here's the log if anyone is curious. For some reason it's
       | letting me make more requests than the stated rate limit. Note
       | how at 11:27:11 it hallucinates on-screen text, and earlier it
       | thinks some random offscreen tile is the stairs. Yes, I'm sure
       | this is the right model: gemini-2.5-pro-exp-03-25.
       | 
       | https://a.qoid.us/20250325/
       | 
       | [1] https://ai.google.dev/gemini-api/docs/rate-limits#tier-1
        
         | comex wrote:
         | Update: I tried a different version of the prompt and it's
         | doing really well! Well, so far it's gotten out of its house
         | and into Professor Oak's lab, which is not so impressive
         | compared to ClaudePlaysPokemon, but it's a lot more than Gemini
         | 2.0 was able to do with the same prompt.
        
       | ototot wrote:
       | And OpenAI is announcing their ImageGen in 4o
       | 
       | https://news.ycombinator.com/item?id=43474112
        
       | malisper wrote:
       | I've been using a math puzzle as a way to benchmark the different
       | models. The math puzzle took me ~3 days to solve with a computer.
       | A math major I know took about a day to solve it by hand.
       | 
       | Gemini 2.5 is the first model I tested that was able to solve it
       | and it one-shotted it. I think it's not an exaggeration to say
       | LLMs are now better than 95+% of the population at mathematical
       | reasoning.
       | 
       | For those curious the riddle is: There's three people in a
       | circle. Each person has a positive integer floating above their
       | heads, such that each person can see the other two numbers but
       | not his own. The sum of two of the numbers is equal to the third.
       | The first person is asked for his number, and he says that he
       | doesn't know. The second person is asked for his number, and he
       | says that he doesn't know. The third person is asked for his
       | number, and he says that he doesn't know. Then, the first person
       | is asked for his number again, and he says: 65. What is the
       | product of the three numbers?
        
         | bbstats wrote:
         | Could you explain "The sum of two of the numbers is equal to
         | the third"??
        
           | malisper wrote:
           | if the three numbers are a, b, and c, then either a+b=c,
           | a+c=b, or b+c=a
        
             | bena wrote:
             | And they must all be positive integers.
             | 
             | So A + B = C and A + C = B. But we know that A + B = C, so
             | we can replace C with (A + B). So we know that A + A + B =
             | B.
             | 
             | So 2A + B = B. Or 2A = 0.
             | 
             | And this holds any way you slice it.
             | 
             | Even if you were to try and brute force it.
             | 
             | A = 1
             | 
             | B = 2
             | 
             | Then C = 3. But A + C has to equal B. That's 1 + 3 = 2?
             | That's not true.
             | 
             | I don't see a case where you can add to the sum of two
             | numbers one of the numbers and get the other number.
             | 
             | I'm guessing that's a misreading of the problem. Because it
             | looks like the third number is the sum of the first two.
        
               | refulgentis wrote:
               | _One of_ the cases has to be true, not all 3. (as you
               | show, they 're mutually exclusive for positive integers)
               | i.e. "either" is important in the parent comment.
        
               | bena wrote:
               | Which is why I indicated that it would be a misreading of
               | the problem.
               | 
               | The original problem is a little ambiguously worded. You
               | could say "one of their numbers is the sum of the other
               | two" and it would be a little clearer.
        
               | thaumasiotes wrote:
               | > The original problem is a little ambiguously worded.
               | 
               | No it isn't. If it said "the sum of any two of the
               | numbers is equal to the third", that would be a
               | contradiction. What it says is "the sum of two of the
               | numbers is equal to the third".
        
           | refulgentis wrote:
           | Given #s x,y, and z, either x + y = z, x + z = y, or y + z =
           | x.
        
           | rappatic wrote:
           | I think:
           | 
           | Call the three numbers a, b, and c. This means c = a + b, but
           | we still don't know to which person each number belongs.
           | 
           | When person 1 (p1) is asked what his number is, he has no way
           | to know whether he has a, b, or c, so he says he doesn't
           | know. Same goes for p2 and p3. Clearly p1 somehow gains
           | information by p2 and p3 passing. Either he realizes that he
           | must be either a or b, and such his number is the difference
           | between p2 and p3's numbers, or he realizes that he must be c
           | and so his number is the sum of p2 and p3's numbers.
           | 
           | That's all I have so far. Anyone have other ideas?
        
             | aardvarkr wrote:
             | I think it has something to do with applying the lower
             | bound of 1.
             | 
             | If p1 KNOWS that he's the largest then he has to have
             | gained some other piece of information. Say the numbers he
             | sees are 32 and 33. His number would have to be either 1 or
             | 65. If p1 was 1 then the other two would have known p1
             | couldn't be the sum of the other two
        
               | oezi wrote:
               | But p2 and p3 don't yet know what they are themselves
               | just because they see a 1:
               | 
               | If p2 sees 1 and 33, s/he would wonder if s/he is 32 or
               | 34.
               | 
               | P3 would consider 31 or 33.
        
             | bena wrote:
             | The answer is online and it's clever.
             | 
             | P1 knows that P2 and P3 are not equal. So they know that
             | the set isn't [2A, A, A].
             | 
             | P2 knows that P1 and P3 are not equal. So they know that
             | the set isn't [A, 2A, A]. They also know that if P1 doesn't
             | know, then they were able to make the same deduction. So
             | they now know that both [2A, A, A] and [A, 2A, A] aren't
             | correct. Since they know that [2A, A, A] isn't correct,
             | they can also know that [2A, 3A, A] isn't correct either.
             | Because they'd be able to see if P1 = 2A and P3 = A, and if
             | that were true and P1 doesn't know their number, it would
             | have to be because P2 isn't A. And if P2 isn't A, they'd
             | have to be 3A.
             | 
             | P3 knows that P1 and P2 aren't equal. Eliminates [A, A,
             | 2A]. Knows that [2A, A, A], [A, 2A, A], and [2A, 3A, A],
             | are eliminated. Using the same process as P2, they can
             | eliminate [2A, A, 3A], [A, 2A, 3A], and also [2A, 3A, 5A].
             | Because they can see the numbers and they know if P1 is 2A
             | and P2 is 3A.
             | 
             | Now we're back at P1. Who now knows.
             | 
             | So P2 and P3 are in the eliminated sets. Which means we're
             | one of these
             | 
             | [2A, A, A]; [3A, 2A, A]; [4A, 3A, A]; [3A, A, 2A]; [4A, A,
             | 3A]; [5A, 2A, 3A]; [8A, 3A, 5A]
             | 
             | We know his number is 65. To find the set, we can factor
             | 65: (5 * 13). We can check the other numbers 2(13) = 26.
             | 3(13) = 39. And technically, you don't need to find the
             | other numbers. The final answer is 5A * 2A * 3A or (A^3) *
             | 30.
        
               | byearthithatius wrote:
               | "Which means we're one of these [2A, A, A]; [3A, 2A, A];
               | [4A, 3A, A]; [3A, A, 2A]; [4A, A, 3A]; [5A, 2A, 3A]; [8A,
               | 3A, 5A]"
               | 
               | Why? Couldn't it be an infinite number of 3 size arrays
               | comprised of A where two elements sum to the third? [24A,
               | 13A, 11A]? How did we deduce this set of arrays?
               | 
               | EDIT: Solved from another reddit comment. Tuples without
               | a common factor like the one above are considered as a=1.
               | 
               | "They're not eliminated; they correspond to a = 1."
        
         | g105b wrote:
         | Please can you enlighten me, I'm a mathematic plebian?
        
         | sebzim4500 wrote:
         | This is a great riddle. Unfortunately, I was easily able to
         | find the exact question with a solution (albeit with a
         | different number) online, thus it will have been in the
         | training set.
        
           | varispeed wrote:
           | Seems like we might need a section of internet that is off
           | limits to robots.
        
             | Centigonal wrote:
             | everyone with limited bandwidth has been trying to limit
             | site access to robots. the latest generation of AI web
             | scrapers are _brutal_ and do not respect robots.txt
        
               | varispeed wrote:
               | There are websites where you can only register to in
               | person and have two existing members vouch for you.
               | Probably still can be gamed, but sounds like a great
               | barrier to entry for robots (for now).
        
               | tmpz22 wrote:
               | What prevents someone from getting access and then
               | running an authenticated headless browser to scoop the
               | data?
        
               | varispeed wrote:
               | Admins will see unusual traffic from that account and
               | then take action. Of course it will not be perfect as
               | there could be a way to mimic human traffic and slowly
               | scrape the data anyway, that's why there is element of
               | trust (two existing members to vouch).
        
             | baq wrote:
             | It's here and it's called discord.
        
               | Zandikar wrote:
               | I have bad news for you if you think non paywalled / non
               | phone# required discord communities are immune to AI
               | scraping, especially as it costs less than hammering
               | traditional websites as the push-on-change event is done
               | for you in real time chat contexts.
               | 
               | Especially as the company archives all those chats (not
               | sure how long) and is small enough that a billion dollar
               | "data sharing" agreement would be a very inticing offer.
               | 
               | If there isn't a significant barrier to access, it's
               | being scraped. And if that barrier is money, it's being
               | scraped but less often.
        
               | Davidzheng wrote:
               | Honestly someone should scrape the algebraic topology
               | Discord to AI it'll be a nice training set
        
             | kylebenzle wrote:
             | Or we could just accept that LLMs can only output what we
             | have put in and calling them, "AI" was a misnomer from day
             | one.
        
           | Workaccount2 wrote:
           | What makes this interesting is that while the question is
           | online (on reddit, from 10 years ago) other models don't get
           | the answer right. Gemini also shows it's work and it seems to
           | do a few orders of magnitude more calculating then the
           | elegant answer given on reddit.
           | 
           | Granted this is all way over my head, but the solution gemini
           | comes to matches the one given on reddit (and now here in
           | future training runs)
           | 
           | 65x26x39=65910
        
             | sebzim4500 wrote:
             | >Gemini also shows it's work and it seems to do a few
             | orders of magnitude more calculating then the elegant
             | answer given on reddit.
             | 
             | I don't think Gemini does an unnecessary amount of
             | computation, it's just more verbose. This is typical of
             | reasoning models, almost every step is necessary but many
             | would not be written down by a human.
        
         | hmottestad wrote:
         | Is the answer somehow {65, 20, 45} with the product 58,500?
         | 
         | That's one-shot for o1 pro.
        
           | sebzim4500 wrote:
           | That's wrong. From player 1's perspective {25 20 45} is
           | entirely consistent with the calls made in the first three
           | rounds.
        
         | refulgentis wrote:
         | In general I find commentary here _too_ negative on AI, but I
         | 'm a bit squeamish about maximalist claims re: AI mathematical
         | reasoning vs. human population based off this, even setting
         | aside lottery-ticket-hypothesis-like concerns.
         | 
         | It's a common logic puzzle, Google can't turn up an _exact_
         | match to the wording you have, but ex. here:
         | https://www.futilitycloset.com/2018/03/03/three-hat-problem/
        
         | TrackerFF wrote:
         | The riddle certainly nerd-sniped GPT 4.5
         | 
         | After a couple of minutes it decided on the answer being 65000.
         | (S = {65, 40, 25)}
        
         | semiinfinitely wrote:
         | I love how the entire comment section is getting one-shotted by
         | your math riddle instead of the original post topic.
        
         | hmottestad wrote:
         | This looks like it's been posted on Reddit 10 years ago:
         | 
         | https://www.reddit.com/r/math/comments/32m611/logic_question...
         | 
         | So it's likely that it's part of the training data by now.
        
           | canucker2016 wrote:
           | You'd think so, but both Google's AI Overview and Bing's
           | CoPilot output wrong answers.
           | 
           | Google spits out: "The product of the three numbers is 10,225
           | (65 * 20 * 8). The three numbers are 65, 20, and 8."
           | 
           | Whoa. Math is not AI's strong suit...
           | 
           | Bing spits out: "The solution to the three people in a circle
           | puzzle is that all three people are wearing red hats."
           | 
           | Hats???
           | 
           | Same text was used for both prompts (all the text after 'For
           | those curious the riddle is:' in the GP comment), so Bing
           | just goes off the rails.
        
             | moritzwarhier wrote:
             | That's a non-sequitur, they would be stupid to run ab
             | expensive _L_LM for every search query. This post is not
             | about Google Search being replaced by Gemini 2.5 and/or a
             | chatbot.
        
               | canucker2016 wrote:
               | Google placed its "AI overview" answer at the top of the
               | page.
               | 
               | The second result is this reddit.com answer, https://www.
               | reddit.com/r/math/comments/32m611/logic_question...,
               | where at least the numbers make sense. I haven't examined
               | the logic portion of the answer.
               | 
               | Bing doesn't list any reddit posts (that Google-exclusive
               | deal) so I'll assume no stackexchange-related sites have
               | an appropriate answer (or bing is only looking for hat-
               | related answers for some reason).
        
               | moritzwarhier wrote:
               | I might have been phrasing poorly. With _L_ (or _L_ as
               | intended), I meant their state-of-the-art model, which I
               | presume Gemini 2.5 is (didn 't come around to TFA yet).
               | Not sure if this question is just about model size.
               | 
               | I'm eagerly awaiting an article about RAG caching
               | strategies though!
        
               | michaelt wrote:
               | Yes, putting an expensive LLM response atop each search
               | query would be quite stupid.
               | 
               | You know what would be even stupider? Putting a _cheap,
               | wrong_ LLM response atop each search query.
        
             | vicek22 wrote:
             | The riddle has a different variants with hats
             | https://erdos.sdslabs.co/problems/5
        
           | malisper wrote:
           | Other models aren't able to solve it so there's something
           | else happening besides it being in the training data. You can
           | also vary the problem and give it a number like 85 instead of
           | 65 and Gemini is still able to properly reason through the
           | problem
        
             | lolinder wrote:
             | I'm sure you're right that it's more than just it being in
             | the training data, but that it's in the training data means
             | that you can't draw any conclusions about general
             | mathematical ability using just this as a benchmark, even
             | if you substitute numbers.
             | 
             | There are lots of possible mechanisms by which this
             | particular problem would become more prominent in the
             | weights in a given round of training even if the model
             | itself hasn't actually gotten any better at general
             | reasoning. Here are a few:
             | 
             | * Random chance (these are still statistical machines after
             | all)
             | 
             | * The problem resurfaced recently and shows up more often
             | than it used to.
             | 
             | * The particular set of RLHF data chosen for this model
             | draws out the weights associated with this problem in a way
             | that wasn't true previously.
        
             | 7e wrote:
             | Sure, but you can't cite this puzzle as proof that this
             | model is "better than 95+% of the population at
             | mathematical reasoning" when the method of solving (the
             | "answer") it is online, and the model has surely seen it.
        
             | mattkevan wrote:
             | I think there's a big push to train LLMs on maths problems
             | - I used to get spammed on Reddit with ads for data tagging
             | and annotation jobs.
             | 
             | Recently these have stopped and they're now the ads are
             | about becoming a maths tutor to AI.
             | 
             | Doesn't seem like a role with long-term prospects.
        
             | stabbles wrote:
             | It gets it wrong when you give it 728. It claims (728, 182,
             | 546). I won't share the answer so it won't appear in the
             | next training set.
        
           | thaumasiotes wrote:
           | > So it's likely that it's part of the training data by now.
           | 
           | I don't think this means what you think it means.
           | 
           | I did some interacting with the Tencent model that showed up
           | here a couple days ago (
           | https://news.ycombinator.com/item?id=43447254 ):
           | 
           | > Ni Zhi Dao Ying Guo De  Steeleye Span Ma ?
           | 
           | >> Shi De ,Wo Zhi Dao Ying Guo Chuan Qi Min Yao Yao Gun Le
           | Dui  Steeleye Span!Ta Men Shi 1970Nian Dai Ying Guo Min Yao
           | Fu Xing Yun Dong De Zhong Yao Dai Biao Zhi Yi ,Yi Rong He
           | Chuan Tong Ying Guo Min Yao Yu Yao Gun Le Qi (Ru Dian Ji Ta ,
           | Gu )De Du Te Feng Ge Wen Ming . Le Dui He Xin Cheng Yuan Bao
           | Gua  Maddy Prior(Biao Zhi Xing Nu Sheng )He  Tim Hart(Duo
           | Mian Shou Le Shou ),Hou Jia Ru De  Nigel Pegrum(Chuan Qi Gu
           | Shou )Ye Wei Ta Men De Yin Xiao Zeng Tian Liao Hou Zhong Gan
           | .
           | 
           | >> Ta Men De Jing Dian Zuo Pin Bao Gua :
           | 
           | >> - <<Gaudete>> :Yi Shou Zhong Shi Ji La Ding Wen Sheng Ge
           | De Zhen Han Gai Bian ,Ceng Zai Ying Guo Dan Qu Bang Chong Dao
           | Di 14Wei ;
           | 
           | >> - <<All Around My Hat>> :Tong Ming Zhuan Ji Zhu Da Ge
           | ,Cheng Wei Min Yao Yao Gun Biao Gan ;
           | 
           | >> - <<Cam Ye O'er Frae France>> :Dui Ying Lun Li Shi De Xi
           | Nue Quan Shi .
           | 
           | >> Le Dui De Cheng Yuan Bian Dong Pin Fan ,Dan Shi Zhong Bao
           | Chi Zhao Dui Chuan Tong Yin Le De Chuang Zao Xing Zhong Gou .
           | Jin Nian Lai Maddy PriorReng Huo Yue Yu Yin Le Jie ,Ruo Xiang
           | Gan Shou Ta Men Kuang Ye De Wu Tai Zhang Li ,Tui Jian Cong
           | 1970Nian Dai De Huang Jin Shi Qi Zuo Pin Ru Shou !Xu Yao Ju
           | Ti Tui Jian Ge Dan Huo Bei Jing Gu Shi Ma ? [happy blushing
           | emoji]
           | 
           | > Wo Xi Huan Ta Men Chang De  The Victory Na Shou Ge . Ni Hui
           | Zen Yao Miao Shu ?
           | 
           | >> Nin Ti Dao De <<The Victory>> Ke Neng Shi Zhi  Steeleye
           | Span De Mou Shou Zuo Pin ,Dan Ta Men De Zheng Shi Zhuan Ji
           | Zhong Bing Mei You Tong Ming Ge Qu De Ji Lu .
           | 
           | >> [further suggestions omitted]
           | 
           | > Yi ?Steeleye Span De  Storm Force Ten Bao Gua Na Shou Ge ?
           | 
           | >> Nin Ti Dao De  <<Storm Force Ten>>  Shi  Steeleye Span De
           | Di Shi Zhang Lu Yin Shi Zhuan Ji ,Fa Bu Yu  1975Nian . Zhe
           | Zhang Zhuan Ji Yan Xu Liao Le Dui Jiang Chuan Tong Ying Guo
           | Min Yao Yu Ying Yao Gun Rong He De Feng Ge ,Tong Shi Rong Ru
           | Liao Kai Er Te , Zhong Shi Ji He Xu Shi Xing Yuan Su ,Zheng
           | Ti Chong Man Huo Li Yu Xi Ju Zhang Li . Yi Xia Shi Zhuan Ji
           | Zhong De Wan Zheng Qu Mu Lie Biao Ji Liang Dian Jie Xi :
           | 
           | >> Zhuan Ji Qu Mu
           | 
           | >> 1.<<Roll Right Stones>> - Yi Ju Shi Zhen Chuan Shuo Wei
           | Ling Gan ,Kai Chang De Di Sheng Yu Dian Ji Ta Jiao Zhi Chu
           | Shen Mi Fen Wei ,Maddy Prior De Sang Yin Ru Shi Shi Yin Song
           | .
           | 
           | >> 2.<<Let Her Go Down>> - Min Yao Ji Di Da Pei Yao Gun Jie
           | Zou ,Ge Ci An Yu Hang Hai Mou Xian Yu Ming Yun Kang Zheng ,Gu
           | Dian Mi Ji Ru Lang Yong .
           | 
           | >> 3.<<Hark! The Heron Voiced>> - Zhong Shi Ji Feng Ge Xu Shi
           | Qu ,Miao Hui Zi Ran Yu Ling Xing De Dui Hua ,Chang Di Yu Ti
           | Qin De Dui Hua Gan Ji Qiang .
           | 
           | >> 4.<<The Brisk Young Widow>> - Qing Kuai De Wu Qu Jie Zou
           | ,Jiang Shu Gua Fu Yu Lang Zi Jian De Hui Xie Gu Shi ,Zhan
           | Xian Le Dui You Mo Yi Mian .
           | 
           | >> 5.<<The Storm Force Ten>> - Tong Ming Zhu Da Ge ,Yi Bao
           | Feng Yu Bi Yu Ai Qing Chong Tu ,Ji Ta soloRu Shan Dian Ban Si
           | Lie ,Gao Chao Bu Fen Ji Ju Bao Fa Li .
           | 
           | >> 6.<<The Green Knight>> - Yuan Zi Ya Se Wang Chuan Shuo ,Qi
           | Shi Yu Zi Ran De Dui Kang Zhu Ti ,Guan Feng Qin Yu Ji Ta De
           | Yin Se Ying Zao Chu Gu Lao Zhan Chang Qi Xi .
           | 
           | >> 7.<<The Blacksmith>> - Gai Bian Zi Chuan Tong Min Yao
           | ,Zhong Yao Gun Bian Qu Rang Tie Jiang De Chui Ji Sheng Hua
           | Zuo Jie Zou Ji Shi ,Maddy De Ji Xing Gao Yin Liang Yan .
           | 
           | >> 8.<<Cam Ye O'er Frae France>> - Zheng Zhi Feng Ci Min Yao
           | ,Diao Kan Ying Guo Wang Shi Li Shi ,Shuang Ji Ta Hu Biao Ru
           | Dao Jian Jiao Feng .
           | 
           | >> 9.<<The Maid of Llanwellyn>> - Kai Er Te Feng Ge Shu Qing
           | Qu ,Xian Le Bian Pai Ru Qi Ru Su ,Zhan Xian Le Dui Rou Qing
           | Yi Mian .
           | 
           | >> 10.<<The Whisky Jacobites>> - Qing Zhu Su Ge Lan Du Li
           | Jing Shen De Kuang Huan Qu ,Shou Feng Qin Yu Yao Gun Le Qi De
           | Peng Zhuang Chong Man Jie Ri Qi Fen .
           | 
           | >> Zhuan Ji Liang Dian
           | 
           | >> - Li Shi Yu Chuan Shuo De Chen Jin Gan :Cong Ya Se Wang
           | Dao Su Ge Lan Qi Yi ,Ge Ci Ru Yi Bu Wei Xing Shi Shi .
           | 
           | >> - Yin Se Shi Yan Xing :Guan Feng Qin , Shou Feng Qin Yu
           | Dian Ji Ta Hun Da Chu Du Te De "Min Yao Zhong Jin Shu "Zhi
           | Gan .
           | 
           | >> - Dong Tai Dui Bi :Kuai Ru Feng Bao De Yao Gun Duan Luo Yu
           | Jing Mi De Min Yao Xu Shi Jiao Ti Chu Xian ,Zhang Li Shi Zu .
           | 
           | >> Ruo Nin Pian Ai <<The Victory>> De Hui Hong Gan ,Tui Jian
           | You Xian Ting  <<The Storm Force Ten>>  He  <<The Green
           | Knight>> ,Zhe Liang Shou De Shi Shi Ji Bian Qu Kan Cheng Le
           | Dui Dian Feng Zhi Zuo !Xu Yao Shi Ting Lian Jie Huo Ge Ci Jie
           | Xi Ma ? [sunglasses emoji]
           | 
           | This is a question that obviously was in the training data.
           | How do you get the answer back out of the training data?
        
           | drexlspivey wrote:
           | And if it wasn't, it is now
        
         | ototot wrote:
         | I also tried one-shot.
         | 
         | https://g.co/gemini/share/badd00a824d2
        
         | yifanl wrote:
         | You'd have better results if you had prompted it with the
         | actual answer and asked how the first person came to the
         | conclusion. Giving a number in the training set is very easy.
         | 
         | i.e. You observe three people in a magical room. The first
         | person is standing underneath a 65, the second person is
         | standing underneath a 26 and the third person is standing
         | underneath a 39. They can see the others numbers but not the
         | one they are directly under. You tell them one of the three
         | numbers is the sum of the other two and all numbers are
         | positive integers. You ask the first person for their number,
         | they respond that they don't know. You ask the second person
         | for their number, they respond that they don't know. You ask
         | the third person, they respond that they don't know. You ask
         | the first person again and they respond with the correct value,
         | how did they know?
         | 
         | And of course, if it responds with a verbatim answer in the
         | line of https://www.reddit.com/r/math/comments/32m611/logic_que
         | stion..., we can be pretty confident what's happening under the
         | hood.
        
         | SwayStar123 wrote:
         | You can also put the AI in the first person's shoes. Prompt:
         | You are standing in a circle, there are 2 other people in the
         | circle with you, everyone in the circle, has a positive integer
         | above their head, no one knows what the number above their own
         | head is but can see the numbers above the heads of the other
         | people. You see that the person infront of you on the left has
         | 26 above their head. The person on the right has 39 above their
         | head. You are told that the sum of two of the numbers is the
         | third number. You are asked what the number above your head is,
         | the option is the sum, 65, or 13, as 26 + 13 = 39. You don't
         | know which one it is, and you say so. The second person is
         | asked the number above their head. They also say they dont
         | know, the third person also says they dont know. What is your
         | number?
         | 
         | Gemini 2.5 and claude 3.7 thinking get it right, o3 mini and 4o
         | get it wrong
        
         | adpirz wrote:
         | Interactive playground for the puzzle:
         | https://claude.site/artifacts/832e77d7-5f46-477c-a411-bdad10...
         | 
         | (All state is stored in localStorage so you can come back to it
         | :) ).
        
         | doener wrote:
         | ChatGPT 4.5:
         | 
         | Let's carefully analyze this puzzle step-by-step:
         | 
         | [?]
         | 
         | Step 1: Define the Problem Clearly
         | 
         | We have three people in a circle. Let's call them: * Person A
         | (first person) * Person B (second person) * Person C (third
         | person)
         | 
         | Each has a positive integer above their head. Each sees the
         | other two numbers but not their own. The numbers satisfy: A + B
         | = C \quad \text{or} \quad B + C = A \quad \text{or} \quad C + A
         | = B
         | 
         | Given the puzzle scenario, we need to find out the exact
         | numbers given the clues, and ultimately find the product of the
         | three numbers.
         | 
         | [?]
         | 
         | Step 2: Logic of the Puzzle (Step-by-Step Reasoning)
         | 
         | Initial conditions: The numbers are positive integers. The
         | puzzle implies a situation of ambiguity at first, but after
         | multiple rounds of "I don't know," Person A finally knows their
         | number is 65.
         | 
         | Let's denote clearly the conditions again: * Exactly one of the
         | following three equations is true: * A + B = C, or * B + C = A,
         | or * C + A = B.
         | 
         | Person A sees B and C, but not A. Similarly for Persons B and
         | C.
         | 
         | [?]
         | 
         | Step 3: Logical deductions from the repeated "I don't know"
         | 
         | First round deductions: * Person A initially doesn't know their
         | number. This means, from Person A's perspective, both scenarios
         | A = B + C and A + B = C (or A + C = B) are still possible. If
         | Person A had seen identical numbers, for instance, Person A
         | could have instantly known their number. Thus, Person A sees
         | two different numbers. * Similarly, Person B doesn't know
         | either, implying that Person B also sees ambiguity in their own
         | number. Thus, Person B also sees two numbers that could result
         | in at least two possible solutions. * Person C also doesn't
         | know, same logic as above.
         | 
         | Thus, after one full round, we understand that initially, all
         | three numbers must create ambiguity. No trivial case (like two
         | identical numbers) is present.
         | 
         | Second round deductions:
         | 
         | After hearing that no one knows their number, each person gains
         | information: they now know the scenario is one in which none of
         | the other two people could have immediately solved it.
         | 
         | Crucially, in the second round: * Person A now knows their
         | number is 65 after considering the previous statements. This
         | implies that Person A sees numbers B and C, and from these
         | numbers plus the statements of uncertainty, Person A deduces
         | that their own number can only be 65.
         | 
         | [?]
         | 
         | Step 4: Deriving the Numbers (Critical deduction)
         | 
         | Let's denote clearly again: Person A's number is now known to
         | be 65.
         | 
         | We have three possible equations. Let's test each carefully
         | given A = 65: 1. Case 1: A + B = C Then we have: * 65 + B = C.
         | Person A sees numbers B and C, so they would see B and 65 + B.
         | Person A sees numbers B and 65+B. If Person A sees these two
         | numbers, they might initially consider their number could be
         | either the sum or difference of these two. * However, if Person
         | A sees two numbers, and one is exactly 2x the other, Person A
         | could immediately guess their number is the difference (since
         | if you see numbers X and 2X, your number is clearly X). So this
         | scenario of B and 65+B is okay only if these two numbers don't
         | exhibit an immediate trivial scenario. * Let's hold this
         | scenario and continue examining others first. 2. Case 2: B + C
         | = A (So, B + C = 65) * Then Person A sees numbers B and C,
         | adding exactly to 65. * Initially, Person A sees two numbers
         | summing exactly to 65. Could Person A know their number
         | immediately at first glance? No, since their number could
         | either be the sum (65) or the difference of the two numbers.
         | Ambiguity here initially makes sense. * Initially, Person B and
         | Person C also see ambiguous scenarios. They both don't know
         | their numbers either at first round. This scenario is
         | plausible. 3. Case 3: C + A = B (So, C + 65 = B) * Then Person
         | A sees numbers B = C + 65 and C. * Person A sees C and C+65.
         | Similar reasoning as Case 1.
         | 
         | [?]
         | 
         | Step 5: Identifying the Correct Scenario
         | 
         | The puzzle's logic typically aligns best with Case 2: B + C = A
         | = 65.
         | 
         | Here's why clearly stated: * Initially, Person A sees two
         | numbers adding to exactly 65. At first glance, Person A's
         | number could either be 65 (sum) or the difference between the
         | two numbers. * Person B also initially sees two numbers and
         | cannot distinguish immediately their own number, nor can Person
         | C. * After everyone states "I don't know," Person A gains
         | critical insight: if the difference between the two numbers
         | (that A sees) allowed Person B or Person C to know their number
         | immediately, they'd have known right away. The fact that B and
         | C didn't know means the difference is not immediately obvious
         | to B or C. Thus, the difference cannot be trivial (such as
         | identical numbers or easy multiples).
         | 
         | Therefore, the two numbers must be such that their difference
         | doesn't immediately clue in the other persons. This occurs most
         | clearly with pairs of numbers which have multiple factorization
         | possibilities or aren't trivial multiples.
         | 
         | But we must carefully test quickly to find a suitable,
         | plausible integer pair whose sum is 65 and whose ambiguity
         | would logically survive through multiple rounds. The puzzle is
         | a classic logic puzzle with a known structure:
         | 
         | A common scenario in these classic puzzles is the use of
         | Fibonacci-like or similar "golden" sets. Let's quickly check
         | the following well-known ambiguous triple: * Numbers: 65, 39,
         | 26 * Check quickly: indeed, 39 + 26 = 65.
         | 
         | Checking logic quickly: * Person A sees 39 and 26. Initially
         | ambiguous (number could be 65 or 13). * Person B sees 65 and
         | 26. Could be either 39 or 91 at first glance. Initially
         | ambiguous. * Person C sees 65 and 39. Could be either 26 or 104
         | initially. Initially ambiguous.
         | 
         | After each person says they don't know: * Now, Person A
         | considers: If my number was 13, then Person B would see numbers
         | 13 and 26. Person B would then immediately know their number is
         | 39 (13+26=39), no ambiguity. But Person B didn't know! So
         | Person A's number cannot be 13. Hence, Person A confidently
         | concludes their number is 65.
         | 
         | Thus, the numbers 65, 39, and 26 satisfy the puzzle conditions
         | perfectly!
         | 
         | [?]
         | 
         | Step 6: Final Answer (Product of the Numbers)
         | 
         | The three numbers identified clearly are: * A = 65 * B = 39 * C
         | = 26
         | 
         | Then, the product is: 65 \times 39 \times 26 = 65 \times 1014 =
         | 65,910
         | 
         | [?]
         | 
         | Final Answer:
         | 
         | \boxed{65,910}
        
         | _cs2017_ wrote:
         | This is solvable in roughly half an hour on pen and paper by a
         | random person I picked with no special math skills (beyond a
         | university). This is far from a difficult problem. The "95%+"
         | in math reasoning is a meaningless standard, it's like saying a
         | model is better than 99.9% of world population in Albanian
         | language, since less than 0.1% bother to learn Albanian.
         | 
         | Even ignoring the fact that this or similar problem may have
         | appeared in the training data, it's something a careful brute-
         | force math logic should solve. It's neither difficult, nor
         | interesting, nor useful. Yes, it may suggest a slight
         | improvement on the basic logic, but no more so than a million
         | other benchmarks people quote.
         | 
         | This goes to show that evaluating models is not a trivial
         | problem. In fact, it's a hard problem (in particular, it's a
         | far far harder than this math puzzle).
        
         | highfrequency wrote:
         | Fun puzzle! I'm curious how you managed to structure the
         | problem such that a computer could solve it but it took 3 days
         | of computation?
        
         | dkjaudyeqooe wrote:
         | > I think it's not an exaggeration to say LLMs are now better
         | than 95+% of the population at mathematical reasoning.
         | 
         | It's not an exaggeration it's a non-sequitur, you first have to
         | show that the LLMs are reasoning in the same way humans do.
        
         | r0fl wrote:
         | Wow
         | 
         | Tried this in deepseek and grok and it kept thunking in loops
         | for a while and I just turned it off
         | 
         | I haven't seen a question loop this long ever.
         | 
         | Very impressed
        
       | honeybadger1 wrote:
       | Claude is still the king right now for me. Grok is 2nd in line,
       | but sometimes it's better.
        
       | nickandbro wrote:
       | Wow, was able to nail the pelican riding on a bicycle test:
       | 
       | https://www.svgviewer.dev/s/FImn7kAo
        
         | anon373839 wrote:
         | That's actually too good to believe. I have a feeling simonw's
         | favorite test has been special-cased...
        
           | Workaccount2 wrote:
           | It seems pretty good at it. The hair on the boy is messed up,
           | but still decent.
           | 
           | "A boy eating a sandwhich"
           | 
           | https://www.svgviewer.dev/s/VhcGxnIR
           | 
           | "A multimeter"
           | 
           | https://www.svgviewer.dev/s/N5Dzrmyt
        
           | sebzim4500 wrote:
           | I doubt it is explicitly special cased, but now that it's all
           | over twitter etc. it will have ended up many times in the
           | training data.
        
           | KTibow wrote:
           | They could've RLed on SVGs - wouldn't be hard to render them,
           | test adherence through Gemini or CLIP, and reward fittingly
        
       | xnx wrote:
       | It will be huge achievement if models can get to the point where
       | so much selection effort isn't required: gemini.google.com
       | currently lists 2.0 Flash, 2.0 Flash Thinking (experimental),
       | Deep Research, Personalization (experimental), and 2.5 Pro
       | (experimental) for me.
        
         | lxgr wrote:
         | There's probably a sweet spot here. On the flip side, ChatGPT
         | currently doesn't indicate whether a given image generation
         | request was serviced by multimodal GPT-4o [1] or Dall-E.
         | 
         | Personally, I do like the "use web search" and "extended
         | thinking" buttons, but ultimately, the models should probably
         | be able to figure out whether doing so would be useful
         | themselves too.
         | 
         | [1] https://news.ycombinator.com/item?id=43474112
        
       | resource_waste wrote:
       | I'll try it tonight, but I'm not excited, its just work.
       | 
       | ChatGPT4.5, I was excited.
       | 
       | Deepseek, I was excited. (then later disappointed)
       | 
       | I know Gemini probably wont answer any medical question, even if
       | you are a doctor. ChatGPT will.
       | 
       | I know I've been disappointed at the quality of Google's AI
       | products. They are backup at best.
        
         | FergusArgyll wrote:
         | It interpreted blood work for me
         | 
         | (Everything's ok, I'm just testing it ;)
        
       | fourseventy wrote:
       | Does it think the founding fathers were a diverse group of mixed
       | races and genders like the last model did?
        
       | eenchev wrote:
       | "My info, the stuff I was trained on, cuts off around early
       | 2023." - Gemini 2.5 to me. Appears that they did a not-so-recent
       | knowledge cutoff in order to use the best possible base model.
        
         | staticman2 wrote:
         | It's unlikely the model knows its actual cutoff date. Try
         | asking 2024 news- for example in my test it knows the January
         | 2024 Oscar nominees.
         | 
         | On AI studio the model told me today is June 13 2024.
        
       | afro88 wrote:
       | Is this the first model announcement where they show Aider's
       | Polyglot benchmark in the performance comparison table? That's
       | huge for Aider and anotherpaulg!
        
       | Dowwie wrote:
       | This model is a fucking beast. I am so excited about the
       | opportunities this presents.
        
       | arjun_krishna1 wrote:
       | I've been using Gemini Pro for my University of Waterloo capstone
       | engineering project. Really good understanding of PDF documents
       | and good reasoning as well as structured output Recommend trying
       | it out at aistudio dot google dot com
        
       | summerlight wrote:
       | This looks like the first model where Google seriously comes back
       | into the frontier competition? 2.0 flash was nice for the price
       | but it's more focused on efficiency, not the performance.
        
       | dcchambers wrote:
       | > Developers and enterprises can start experimenting with Gemini
       | 2.5 Pro in Google AI Studio now, and Gemini Advanced users can
       | select it in the model dropdown on desktop and mobile. It will be
       | available on Vertex AI in the coming weeks.
       | 
       | I'm a Gemini Advanced subscriber, still don't have this in the
       | drop-down model selection in the phone app, though I do see it on
       | the desktop webapp.
        
         | ehsankia wrote:
         | I see it in both, probably just some gradual rollout delays.
        
       | cp9 wrote:
       | does it still suggest glue on pizza
        
       | lvl155 wrote:
       | With recent pace of model updates, I wonder which factor is more
       | important: hardware assets, software/talent, or data access.
       | Google clearly is in the lead in terms of data access in my view.
       | If I am a top talent in AI, I'd go where I can work with the best
       | data no?
        
         | CuriouslyC wrote:
         | The problem Goog has is its insane bureaucracy and lack of
         | vision from Sundar, which isn't very attractive from an
         | employee position. If you're working close to Demis I imagine
         | the situation is better though.
        
           | rfoo wrote:
           | Now that Noam is back I'm a little bit more optimistic.
        
         | danpalmer wrote:
         | I think an argument could be made for hardware too. Perhaps in
         | absolute terms Nvidia is ahead, but in terms of knowing how to
         | get the most out of the hardware, Google making its own chips,
         | building on their networking, etc, is a pretty big advantage.
         | 
         | (Disclaimer, Googler, but I don't work on any of this, I only
         | have an external layperson's understanding of it)
        
       | d3nj4l wrote:
       | A model that is better on Aider than Sonnet 3.7? For _free_ ,
       | right now? I think I'll give it a spin this weekend on a couple
       | of projects, seems too good to be true.
        
         | istjohn wrote:
         | With a rate limit of 50 requests per day
        
       | pachico wrote:
       | It really surprises me that Google and Amazon, considering their
       | infrastructure and the urge to excel at this, aren't leading the
       | industry.
        
       | WasimBhai wrote:
       | I do not intend to take anything away from the technical
       | achievement of the team. However, as Satya opined some weeks
       | back, these benchmarks do not mean a lot if we do not see a
       | comparable increase in productivity.
       | 
       | But then there are two questions. First, are the white collar
       | workers specifically consultants, engineers responsible for
       | increase in productivity? Or is the white collar workers at the
       | very right tail e.g., scientists?
       | 
       | I think consultants and engineers are using these technologies a
       | lot. I think biologists at least are using these models a lot.
       | 
       | But then where is the productivity increases?
        
         | ZeroTalent wrote:
         | Also, why not compare to GPT-o3 in the benchmarks?
        
           | KoolKat23 wrote:
           | The models not really available.
        
         | KoolKat23 wrote:
         | As it stands currently I'd say this is difficult to measure.
         | 
         | They're not baked into workflows where the measurable output is
         | attributed easily to the model use. Productivity in its current
         | form is transformative in the sense that the use case and gain
         | differs for the individual (who even provide different
         | prompts). So some are keeping the gains for themselves, others
         | are using it to improve quality rather than quantity.
         | 
         | It'll come in time, it's important to remember gpt 4 was
         | released 2 years ago this month. The newer models are more
         | reliable and could probably be introduced into workflows more
         | frequently. Today I spoke to a company who are looking to use
         | it to reduce cost in the next year.
        
         | danpalmer wrote:
         | That's true, but productivity has many factors and takes a long
         | time to get confidence on. Any productivity value that could be
         | stated clearly would have similar downsides to a benchmark, and
         | take far longer.
         | 
         | Benchmarks are useful as leading indicators. Early warning
         | signs. If there's no relation to the eventual productivity then
         | hopefully that benchmark will disappear as it's not useful.
         | 
         | In a fast moving space like this it's reasonable to make use of
         | leading indicators.
        
       | jorl17 wrote:
       | Just a couple of days ago I wrote on reddit about how long
       | context models are mostly useless to me, because they start
       | making too many mistakes very fast. They are vaguely helpful for
       | "needle in a haystack" problems, not much more.
       | 
       | I have a "test" which consists in sending it a collection of
       | almost 1000 poems, which currently sit at around ~230k tokens,
       | and then asking a bunch of stuff which requires reasoning over
       | them. Sometimes, it's something as simple as "identify key
       | writing periods and their differences" (the poems are ordered
       | chronologically). Previous models don't usually "see" the final
       | poems -- they get lost, hallucinate and are pretty much
       | worthless. I have tried several workaround techniques with
       | varying degrees of success (e.g. randomizing the poems).
       | 
       | Having just tried this model (I have spent the last 3 hours
       | probing it), I can say that, to me, this is a breakthrough
       | moment. Truly a leap. This is the first model that can
       | consistently comb through these poems (200k+ tokens) and analyse
       | them as a whole, without significant issues or problems. I have
       | no idea how they did it, but they did it.
       | 
       | The analysis of this poetic corpus has few mistakes and is very,
       | very, very good. Certainly very good in terms of how quickly it
       | produces an answer -- it would take someone days or weeks of
       | thorough analysis.
       | 
       | Of course, this isn't about poetry -- it's about passing in huge
       | amounts of information, without RAG, and having a high degree of
       | confidence in whatever reasoning tasks this model performs. It is
       | the first time that I feel confident that I could offload the
       | task of "reasoning" over large corpus of data to an LLM. The
       | mistakes it makes are minute, it hasn't hallucinated, and the
       | analysis is, frankly, better than what I would expect of most
       | people.
       | 
       | Breakthrough moment.
        
         | Alifatisk wrote:
         | Two years ago, Claude was known for having the largest context
         | window and being able to remember tokens throughout the whole
         | conversation.
         | 
         | Today, it seems like Google has beat them and supports way
         | larger context window and is way better at keeping track of
         | what has being said and memorize older tokens.
        
       | greatgib wrote:
       | If you plan to use Gemini, be warned, here are the usual Big Tech
       | dragons:                  Please don't enter ...confidential info
       | or any data... you wouldn't want a reviewer to see or Google to
       | use ...
       | 
       | The full extract of the terms of usage:                  How
       | human reviewers improve Google AI             To help with
       | quality and improve our products (such as the generative machine-
       | learning models that power Gemini Apps), human reviewers
       | (including third parties) read, annotate, and process your Gemini
       | Apps conversations. We take steps to protect your privacy as part
       | of this process. This includes disconnecting your conversations
       | with Gemini Apps from your Google Account before reviewers see or
       | annotate them. Please don't enter confidential information in
       | your conversations or any data you wouldn't want a reviewer to
       | see or Google to use to improve our products, services, and
       | machine-learning technologies.
        
         | greatgib wrote:
         | And even worse:                  Conversations that have been
         | reviewed or annotated by human reviewers (and related data like
         | your language, device type, location info, or feedback) are not
         | deleted when you delete your Gemini Apps activity because they
         | are kept separately and are not connected to your Google
         | Account. Instead, they are retained for up to three years.
         | 
         | Emphasis on "retained for up to three years" even if you delete
         | it!!
        
           | mastodon_acc wrote:
           | How does it compare to OpenAI and anthropic's user data
           | retention policy?
        
             | greatgib wrote:
             | If i'm not wrong, Chatgpt states clearly that they don't
             | use user data anymore by default.
             | 
             | Also, maybe some services are doing "machine learning"
             | training with user data, but it is the first time I see
             | recent LLM service saying that you can feed your data to
             | human reviewers at their will.
        
               | cubefox wrote:
               | They seem to use it as long as the chat history is
               | enabled, similar to Gemini.
               | https://help.openai.com/en/articles/7792795-how-do-i-
               | turn-of...
        
           | kccqzy wrote:
           | Well they can't delete a user's Gemini conversations because
           | they don't know which user a particular conversation comes
           | from.
           | 
           | This seems better, not _worse_ , than keeping the user-
           | conversation mapping so that the user may delete their
           | conversations.
        
         | mastodon_acc wrote:
         | I mean this is pretty standard for online llms. What is Gemini
         | doing here that openai or Anthropic aren't already doing?
        
         | KoolKat23 wrote:
         | I don't think this is the same as the AI studio and API terms.
         | This looks like your consumer facing Gemini T&C's.
        
         | sauwan wrote:
         | I'm assuming this is true of all experimental models? That's
         | not true with their models if you're on a paid tier though,
         | correct?
        
         | summerlight wrote:
         | You can use a paid tier to avoid such issues. Not sure what
         | you're expecting for those "experimental" models, which is in
         | development and needs user feedback.
        
         | suyash wrote:
         | More of a reason for new privacy guidelines specially for big
         | tech and AI
        
         | cavisne wrote:
         | Google is the best of these. You either pay per token and there
         | is no training on your inputs, or it's free/a small monthly fee
         | and there is training.
        
       | TheMagicHorsey wrote:
       | I tested out Gemini 2.5 and it failed miserably at calling into
       | tools that we had defined for it. Also, it got into an infinite
       | loop a number of times where it would just spit out the exact
       | same line of text continuously until we hard killed the process.
       | I really don't know how others are getting these amazing results.
       | We had no problems using Claude or OpenAI models in the same
       | scenario. Even Deepseek R1 works just fine.
        
       | batata_frita wrote:
       | Why do I have the feel that nobody is too much excited to
       | google's models compared to other companies?
        
         | Mond_ wrote:
         | Google is worse at marketing and hyping people up.
        
         | bklyn11201 wrote:
         | Most of us care only about coding performance, and Sonnet 3.5
         | has been such a giant winner that we don't get too excited
         | about the latest model from Google.
        
         | SamuelAdams wrote:
         | Because it's more likely to be sunsetted.
         | 
         | https://killedbygoogle.com/
        
         | owlninja wrote:
         | Yea I get a little bummed but I guess a lot of HNers have
         | reasons to not like google. I've had a Google One membership
         | forever so opted for the higher subscription with Gemini access
         | since the beginning (plus a free year with new Pixel phone).
         | and I think it is awesome.
        
         | ipsum2 wrote:
         | They're not good models. They over fit to LMArena leaderboard,
         | but perform worse in real life scenarios compared to their
         | competitors.
         | 
         | The exceptions are auto regressive image generation and audio
         | models.
        
         | Davidzheng wrote:
         | I feel like Google intentionally don't want people to be as
         | excited. This is a very good model. Definitely the best
         | available model today.
        
         | crimsoneer wrote:
         | Because most of the LLM hype is still generated by people who
         | don't use them in production, and those people don't use GCP
        
       | asah wrote:
       | It nailed my two hard reasoning+linguistic+math questions in one
       | shot, both the kinds of things that LLM struggle but humans do
       | well.
       | 
       | (DM me for the questions)
        
       | rodolphoarruda wrote:
       | I've been trying to use Gemini 2.0 Flash, but I don't think it's
       | possible. The model still thinks it's running the 1.5 Pro model.
       | 
       | Reference: https://rodolphoarruda.pro.br/wp-
       | content/uploads/image-14.pn...
        
       | strstr wrote:
       | It's a lot better at my standard benchmark "Magic: The Gathering"
       | rules puzzles. Gets the answers right (both the outcome and
       | rationale).
        
         | strstr wrote:
         | Ooof, it failed my "Wheel of Potential" bug finding question,
         | and got aggressive about asserting it was correct.
        
       | Davidzheng wrote:
       | On initial thoughts, I think this might be the first AI model to
       | be reliably helpful as a research assistant in pure mathematics
       | (o3-mini-high can be helpful but is more prone to hallucinations)
        
       | simonw wrote:
       | I'm impressed by this one. I tried it on audio transcription with
       | timestamps and speaker identification (over a 10 minute MP3) and
       | drawing bounding boxes around creatures in a complex photograph
       | and it did extremely well on both of those.
       | 
       | Plus it drew me a very decent pelican riding a bicycle.
       | 
       | Notes here: https://simonwillison.net/2025/Mar/25/gemini/
        
         | kridsdale3 wrote:
         | I just tried your trademark benchmark on the new 4o Image
         | Output, though it's not the same test:
         | 
         | https://imgur.com/a/xuPn8Yq
        
           | Unroasted6154 wrote:
           | Was that an actual SVG?
        
             | simonw wrote:
             | No that's GPT-4o native image output.
        
               | sebzim4500 wrote:
               | I wonder how far away we are from models which, given
               | this prompt, generate that image in the first step in
               | their chain-of-thought and then use it as a reference to
               | generate SVG code.
               | 
               | It could be useful for much more than just silly
               | benchmarks, there's a reason why physics students are
               | taught to draw a diagram before attempting a problem.
        
               | simonw wrote:
               | Someone managed to get ChatGPT to render the image using
               | GPT-4o, then save that image to a Code Interpreter
               | container and run Python code with OpenCV to trace the
               | edges and produce an SVG:
               | https://bsky.app/profile/btucker.net/post/3lla7extk5c2u
        
       | og_kalu wrote:
       | One of the biggest problems with hands off LLM writing (for long
       | horizon stuff like novels) is that you can't really give them any
       | details of your story because they get absolutely neurotic with
       | it.
       | 
       | Imagine for instance you give the LLM the profile of the love
       | interest for your epic fantasy, it will almost always have the
       | main character meeting them within 3 pages (usually page 1) which
       | is of course absolutely nonsensical pacing. No attempt to tell it
       | otherwise changes anything.
       | 
       | This is the first model that after 19 pages generated so far
       | resembles anything like normal pacing even with a TON of details.
       | I've never felt the need to generate anywhere near this much.
       | Extremely impressed.
       | 
       | Edit: Sharing it -
       | https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
       | 
       | with pastebin - https://pastebin.com/aiWuYcrF
        
         | fshr wrote:
         | I think you would be better off having the LLM help you build
         | up the plot with high level chapter descriptions and then have
         | it dig into each chapter or arc. Or start by giving it the
         | beats before you ask it for help with specifics. That'd be
         | better at keeping it on rails.
        
           | og_kalu wrote:
           | I don't disagree. Like with almost anything else involving
           | LLMs, getting hands on produces better results but because in
           | this instance, i much prefer to be the reader than the author
           | or editor, it's really important to me that a LLM is capable
           | of pacing long form writing properly on its own.
        
         | Workaccount2 wrote:
         | Can you share it on a text sharing site? It seems you hit your
         | share quota
        
           | og_kalu wrote:
           | Sure - https://pastebin.com/aiWuYcrF
        
         | nout wrote:
         | Using the AI in multiple phases is the approach that can handle
         | this. Similarly to "Deep Research" approach - you can tell it
         | to first generate a storyline with multiple twists and turns.
         | Then ask the model to take this storyline and generate prompts
         | for individual chapters. Then ask it to generate the individual
         | chapters based on the prompts, etc.
        
       | mclau156 wrote:
       | Generated 1000 lines of turn based combat with shop, skills,
       | stats, elements, enemy types, etc. with this one
        
       | nikcub wrote:
       | Impressive model - but I'm confused by the knowledge cutoff. AI
       | Studio says it is January 2025 (which would be impressive) but
       | querying it for anything early 2025 or mid/late 2024 and it self-
       | reports that it's cutoff is in 2023 (which can't be right).
       | 
       | This is most evident when querying about fast-moving dev tools
       | like uv or bun. It seems to only know the original uv options
       | like pip and tools, while with bun it is unfamiliar with bun
       | outdated (from Aug 2024), bun workspaces (from around that time?)
       | but does know how to install bun on windows (April 2024).
       | 
       | You'll still need to provide this model with a lot of context to
       | use it with any tooling or libraries with breaking changes or new
       | features from the past ~year - which seems to contradict the AI
       | Studio reported knowledge cutoff.
       | 
       | Were I developing models - I'd prioritise squeezing in the most
       | recent knowledge of popular tools and libraries since development
       | is such a popular (and revenue generating) use case.
        
       | joshdavham wrote:
       | When these companies release a model "2.5", are they using some
       | form of semver? Where are these numbers coming from?
        
       | zone411 wrote:
       | Scores 54.1 on the Extended NYT Connections Benchmark, a large
       | improvement over Gemini 2.0 Flash Thinking Experimental 01-21
       | (23.1).
       | 
       | 1 o1-pro (medium reasoning) 82.3
       | 
       | 2 o1 (medium reasoning) 70.8
       | 
       | 3 o3-mini-high 61.4
       | 
       | 4 Gemini 2.5 Pro Exp 03-25 54.1
       | 
       | 5 o3-mini (medium reasoning) 53.6
       | 
       | 6 DeepSeek R1 38.6
       | 
       | 7 GPT-4.5 Preview 34.2
       | 
       | 8 Claude 3.7 Sonnet Thinking 16K 33.6
       | 
       | 9 Qwen QwQ-32B 16K 31.4
       | 
       | 10 o1-mini 27.0
       | 
       | https://github.com/lechmazur/nyt-connections/
        
       ___________________________________________________________________
       (page generated 2025-03-25 23:00 UTC)