[HN Gopher] OpenAI o3-pro
       ___________________________________________________________________
        
       OpenAI o3-pro
        
       Author : mfiguiere
       Score  : 122 points
       Date   : 2025-06-10 20:15 UTC (2 hours ago)
        
 (HTM) web link (help.openai.com)
 (TXT) w3m dump (help.openai.com)
        
       | mmsc wrote:
       | I understand that things are moving fast and all, but surely
       | the.. 8? models which are currently available is a bit ..
       | overwhelming for users that just want to get answers to their
       | questions of life? What's the end goal with having so many models
       | available?
        
         | Osyris wrote:
         | This is a much more expensive model to run and is only
         | available to users who pay the most. I don't see an issue.
         | 
         | However, the "plus" plan absolutely could use some trimming.
        
         | bachittle wrote:
         | free users don't have this model selector, and probably don't
         | care which model they get so 4o is good enough. paid users at
         | 20$/month get more models which are better, like o3. paid users
         | at 200$/month get the best models that are also costing OpenAI
         | the most money, like o3-pro. I think they plan to unify them
         | with GPT-5.
        
           | stavros wrote:
           | That doesn't help much when we're asymptotically approaching
           | GPT-5. We're probably going to be at GPT-4.9999 soon.
        
           | nikcub wrote:
           | I'd be curious what proportion of paid users ever switch
           | models. I'd guess < 10%
        
             | CamperBob2 wrote:
             | I switch to o1-pro on occasion, but it is slow enough that
             | I don't use it as much as some of the others. It is a
             | reasonably-effective last resort when I'm not getting the
             | answer quality that I think should be achievable. It's the
             | best available reasoning model from any provider by a
             | noticeable margin.
             | 
             | Sounds like o3-pro is even slower, which is fine as long as
             | it's better.
             | 
             | o4-mini-high is my usual go-to model if I need something
             | better than the default GPT4-du jour. I don't see much
             | point in the others and don't understand why they remain
             | available. If o3-pro really is consistently better, it will
             | move o1-pro into that category for me.
        
             | CuriouslyC wrote:
             | If you're not at least switching from 4o to 4.1 you're
             | doing it wrong.
        
         | macawfish wrote:
         | Overwhelming yet pretty underwhelming
        
         | nickysielicki wrote:
         | I just can't believe nobody at the company has enough courage
         | to tell their leadership that their naming scheme is completely
         | stupid and insane. Four is greater than three, and so four
         | should be better than three. The point of a name is to describe
         | something so that you don't confuse your users, not to be cute.
        
           | browningstreet wrote:
           | At Techcrunch AI last week, the OpenAI guy started his
           | presentation by acknowledging that OpenAI knows their naming
           | is a problem and they're working on it, but it won't be fixed
           | immediately.
        
             | moomin wrote:
             | I know they have a deep relationship with Microsoft, but
             | perhaps they shouldn't have used Microsoft's product naming
             | department.
        
               | orra wrote:
               | Zune .NET O3... _shudders_
        
             | simonw wrote:
             | Sam Altman has said the same thing on Twitter a few times.
             | https://x.com/sama/status/1911906570835022319
             | 
             | > how about we fix our model naming by this summer and
             | everyone gets a few more months to make fun of us (which we
             | very much deserve) until then?
        
               | nickysielicki wrote:
               | I'd prefer for them to just fix it asap instead and then
               | keep the existing endpoints around for a year as aliases.
        
           | MallocVoidstar wrote:
           | The reason their naming scheme is so bad is because their
           | initial attempts at GPT-5 failed in training. It was supposed
           | to be done ~1 year ago. Because they'd promised that GPT-5
           | would be vastly more intelligent than GPT-4, they couldn't
           | just name any random model "GPT-5", so they suddenly had to
           | start naming things differently. So now there's GPT-4.5,
           | GPT-4.1, the o-series, ...
        
           | transcriptase wrote:
           | What's worse is that the app doesn't even have descriptions.
           | As if I'm supposed to memorize the use case for each based
           | on:
           | 
           | GPT-4o
           | 
           | o3
           | 
           | o4-mini
           | 
           | o4-mini-high
           | 
           | GPT-4.5
           | 
           | GPT-4.1
           | 
           | GPT-4.1-mini
        
             | koakuma-chan wrote:
             | Just use o4-mini for everything
        
           | aetherspawn wrote:
           | Came here to say this, the naming scheme is ridiculous and is
           | getting more impossible to follow each day.
           | 
           | For example the other day they released a supposedly better
           | model with a lower number..
        
             | aetherspawn wrote:
             | I'd honestly prefer they just have 3 personas of varying
             | cost/intelligence: Sam, Elmo and Einstein or something, and
             | then tack on the date, elmo-2025-1 and silently delete the
             | old ones.
        
         | levocardia wrote:
         | There's a humorous version of Poe's law that says "any
         | sufficiently genuine attempt to explain the differences between
         | OpenAI's models is indistinguishable from parody"
        
         | paxys wrote:
         | > users that just want to get answers to their questions of
         | life
         | 
         | Those users go to chat.openai.com (or download the app), type
         | text in the box and click send.
        
         | AtlasBarfed wrote:
         | I'd like one to do my test use case:
         | 
         | Port unix-sed from c to java with a full test suite and all
         | options supported.
         | 
         | Somewhere between "it answers questions of life" and "it beats
         | PhDs at math questions", I'd like to see one LLM take this,
         | IMO, rather "pure" language task and succeeed.
         | 
         | It is complicated, but it isn't complex. It's string operations
         | with a deep but not that deep expression system and flag set.
         | 
         | It is well-described and documented on the internet, and
         | presumably training sets. It is succinctly described as a
         | problem that virtually all computer coders would understand
         | what it entailed if it were assigned to them. It is drudgerous,
         | showing the opportunity for LLMs to show how they would improve
         | true productivity.
         | 
         | GPT fails to do anything other than the most basic substitute
         | operations. Claude was only slightly better, but to its
         | detriment hallucinated massive amounts and made fake passing
         | test cases that didn't even test the code.
         | 
         | The reaction I get to this test is ambivalence, but IMO if LLMs
         | could help port entire software packages between languages with
         | similar feature sets (aside from Turing Completeness), then
         | software cross-use would explode, and maybe we could port
         | "vulnerable" code to "safe" Rust en masse.
         | 
         | I get it, it's not what they are chasing customer-wise. They
         | want to write (in n-gate terms) webcrap.
        
           | CamperBob2 wrote:
           | How does the latest Gemini 2.5 Pro Ultra Flash Max Hemi XLT
           | release do on that task? It obviously demands a massive
           | context window.
        
         | resters wrote:
         | Models are used for actual tasks where predictable behavior is
         | a benefit. Models are also used on cutting-edge tasks where
         | smarter/better outputs are highly valued. Some applications
         | value speed and so a new, smaller/cheaper model can be just
         | right.
         | 
         | I think the naming scheme is just fine and is very
         | straightforward to anyone who pays the slightest bit of
         | attention.
        
       | manmal wrote:
       | The benchmarks don't look _that_ much better than o3. Does that
       | mean Pro models are just incrementally better than base models,
       | or are we approaching the higher end of a sigmoid function, with
       | performance gains leveling off?
        
         | bachittle wrote:
         | it's the same model as o3, just with thinking tokens turned up
         | to the max.
        
           | Tiberium wrote:
           | That's simply not true, it's not just "max thinking budget
           | o3" just like o1-pro wasn't "max thinking budget o1". The
           | specifics are unknown, but they might be doing multiple model
           | generations and then somehow picking the best answer each
           | time? Of course that's a gross simplification, but some
           | assume that they do it this way.
        
             | cdblades wrote:
             | > That's simply not true, it's not just "max thinking
             | budget o3"
             | 
             | > The specifics are unknown, but they might...
             | 
             | Hold up.
             | 
             | > but some assume that they do it this way.
             | 
             | Come on now.
        
               | MallocVoidstar wrote:
               | Good luck finding the tweet (I can't) but at least one
               | OpenAI engineer has said that o1-pro was _not_ just  'o1
               | thinking longer'.
        
               | boole1854 wrote:
               | I also don't have that tweet saved, but I do remember it.
        
               | PhilippGille wrote:
               | This one? Found with Kagi Assistant.
               | 
               | https://x.com/michpokrass/status/1869102222598152627
               | 
               | It says:
               | 
               | > hey aidan, not a miscommunication, they are different
               | products! o1 pro is a different implementation and not
               | just o1 with high reasoning.
        
             | firejake308 wrote:
             | > "We also introduced OpenAI o3-pro in the API--a version
             | of o3 that uses more compute to think harder and provide
             | reliable answers to challenging problems"
             | 
             | Sounds like it is just o3 with higher thinking budget to me
        
         | dyauspitr wrote:
         | Don't they have a full fledged version of o4 somewhere
         | internally at this point?
        
           | ankit219 wrote:
           | They do it seems. o1 and o3 were based on the same base
           | model. o4 is going to be based on a newer (and perhaps
           | smarter) base model.
        
       | tiahura wrote:
       | So, upgrade to Teams and pay the $50? Plus more usage of o3.
       | Seems like it might be a shot at the $100 claude max?
        
         | dog436zkj3p7 wrote:
         | What do you mean with "pay the $50"?
         | 
         | Also, does anybody know what limits o3-pro has under the team
         | plan? I don't see it available in the model picker at all (on
         | team).
        
       | carmelion wrote:
       | Jl App
        
       | cluckindan wrote:
       | "designed to think longer"
       | 
       | Translation: designed to fool people into believing it's thinking
       | longer, or at all.
        
         | Workaccount2 wrote:
         | With Gemini 2.5 in AI studio you can now increase the amount of
         | thinking tokens, and it definitely makes a difference. O3 pro
         | is most likely O3 with an expanded thinking token budget.
        
           | energy123 wrote:
           | Isn't that just increasing the upper bound on thinking
           | tokens, which is rarely hit even on much lower levels?
        
           | cluckindan wrote:
           | It is not thinking. It is trying to deceive you. The
           | "reasoning" it outputs does not have a causal relationship
           | with the end result.
        
             | benxh wrote:
             | The longer "it" reasons, the more attention sinks are used
             | to come to a "better" final output.
        
               | manmal wrote:
               | I've looked up attention sinks and can't figure out how
               | you're using the term here. It sounds interesting, would
               | you care to elaborate?
        
             | Sohcahtoa82 wrote:
             | > The "reasoning" it outputs does not have a causal
             | relationship with the end result.
             | 
             | It absolutely does.
             | 
             | Now, we can argue all about whether it's truly "reasoning",
             | but I've certainly seen cases where if you ask it a
             | question but say "Give just the answer", it'll consistently
             | give a wrong answer, whereas if you let it explain its
             | thought process before giving a final answer, it'll
             | consistently get it right.
             | 
             | LLMs are at their core just next-token guessing machines.
             | By allowing them to output extra "reasoning" tokens, it can
             | prime the context to give better answers.
             | 
             | Think of it like solving an algebraic equation. Humans
             | can't typically solve any but the most trivial equations in
             | a single step, and neither can an LLM. But like a human, an
             | LLM can solve one if it takes it one step at a time.
        
           | dbbk wrote:
           | Or my favourite, tell Claude to "ultrathink"
        
       | ChrisArchitect wrote:
       | Related:
       | 
       |  _OpenAI dropped the price of o3 by 80%_
       | 
       | https://news.ycombinator.com/item?id=44239359
        
       | swyx wrote:
       | here's a nice user review we published:
       | https://www.latent.space/p/o3-pro
       | 
       | sama's highlight[0]:
       | 
       | > "The plan o3 gave us was plausible, reasonable; but the plan o3
       | Pro gave us was specific and rooted enough that it actually
       | changed how we are thinking about our future."
       | 
       | I kept nudging the team to go the whole way to just let o3 be
       | their CEO but they didn't bite yet haha
       | 
       | 0: https://x.com/sama/status/1932533208366608568
        
         | tomComb wrote:
         | Big fan swyx, but both here and in the article there is some
         | bragging about being quoted by sama, and while I acknowledge
         | that that's not out of the ordinary, I'm concerned about where
         | it leads: what it takes to get quoted by sama (or similar
         | interested party) is saying something good about his product,
         | and having a decent follower count.
         | 
         | Dangerous incentives IMO.
        
       | WhitneyLand wrote:
       | So, we currently have o4-mini and o4-mini-high, which represent
       | medium and high usage of "thinking" or use of reasoning tokens.
       | 
       | This announcement adds o3-pro, which pairs with o3 in the same
       | way the o4 models go together.
       | 
       | It should be called o3-high, but to align with the $200 pro
       | membership it's called pro instead.
       | 
       | That said o3 is already an incredibly powerful model. I prefer it
       | over the new Anthropic 4 models and Gemini 2.5. It's raw power
       | seems similar to those others, but it's so good at inline tool
       | use it usually comes out ahead overall.
       | 
       | Any non-trivial code generation/editing should be using an
       | advanced reasoning model, or else you're losing time fixing more
       | glitches or missing out on better quality solutions.
       | 
       | Of course the caveat is cost, but there's value on the frontier.
        
         | boole1854 wrote:
         | No, this doesn't seem to be correct, although confusion
         | regarding model names is understandable.
         | 
         | o4-mini-high is the label on chatgpt.com for what in the API is
         | called o4-mini with reasoning={"effort": "high"}. Whereas
         | o4-mini on chatgpt.com is the same thing as
         | reasoning={"effort": "medium"} in the API.
         | 
         | o3 can also be run via the API with reasoning={"effort":
         | "high"}.
         | 
         | o3-pro is _different_ than o3 with high reasoning. It has a
         | separate endpoint, and it runs for much longer.
         | 
         | See https://platform.openai.com/docs/guides/reasoning?api-
         | mode=r...
        
       | chad1n wrote:
       | The guys in the other thread who said that OpenAI might have
       | quantized o3 and that's how they reduced the price might be
       | right. This o3-pro might be the actual o3-preview from the
       | beginning and the o3 might be just a quantized version. I wish
       | someone benchmarks all of these models to check for drops in
       | quality.
        
         | simonw wrote:
         | That's definitely not the case here. The new o3-pro is _slow_ -
         | it took two minutes just to draw me an SVG of a pelican riding
         | a bicycle. o3-preview was much faster than that.
         | 
         | https://simonwillison.net/2025/Jun/10/o3-pro/
        
           | k2xl wrote:
           | Not distilled, same model. https://x.com/therealadamg/status/
           | 1932534244774957121?s=46&t...
        
           | CamperBob2 wrote:
           | Would you say this is the best cycling pelican to date? I
           | don't remember any of the others looking better than this.
           | 
           | Of course by now it'll be in-distribution. Time for a new
           | benchmark...
        
             | jstummbillig wrote:
             | I love that we are in the timeline where we are somewhat
             | seriously evaluating probably super human intelligence by
             | their ability to draw a svg of a cycling pelican.
        
               | CamperBob2 wrote:
               | I still remember my jaw hitting the floor when the first
               | DALL-E paper came out, with the baby daikon radish
               | walking a dog. How the actual _fuck_...? Now we 're
               | probably all too jaded to fully appreciate the next
               | advance of that magnitude, whatever that turns out to be.
               | 
               | E.g., the pelicans all look pretty cruddy including this
               | one, but the fact that they are being delivered in .SVG
               | is a bigger deal than the quality of the artwork itself,
               | IMHO. This isn't a diffusion model, it's an
               | autoregressive transformer imitating one. The wonder
               | isn't that it's done badly, it's that it's happening at
               | all.
        
           | AstroBen wrote:
           | That's one good looking pelican
        
         | gkamradt wrote:
         | o3-pro is not the same as the o3-preview that was shown in Dec
         | '24. OpenAI confirmed this for us. More on that here:
         | https://x.com/arcprize/status/1932535380865347585
        
         | weinzierl wrote:
         | Is there a way to figure out likely quantization from the
         | output. I mean, does quantization degrade output quality in
         | certain ways that are different from other modification of
         | other model properties (e.g. size or distillation)?
        
       | DanMcInerney wrote:
       | I'm really hoping GPT5 is a larger jump in metrics than the last
       | several releases we've seen like Claude3.5 - Claude4 or o3-mini-
       | high to o3-pro. Although I will preface that with the fact I've
       | been building agents for about a year now and despite the
       | benchmarks only showing slight improvement, I have seen that each
       | new generation feels actively better at exactly the same tasks I
       | gave the previous generation.
       | 
       | It would be interesting if there was a model that was
       | specifically trained on task-oriented data. It's my understanding
       | they're trained on all data available, but I wonder if it can be
       | fine-tuned or given some kind of reinforcement learning on
       | breaking down general tasks to specific implementations.
       | Essentially an agent-specific model.
        
         | codingwagie wrote:
         | I'm seeing big advances that arent shown in the benchmarks, I
         | can simply build software now that I couldnt build before. The
         | level of complexity that I can manage and deliver is higher.
        
           | shmoogy wrote:
           | Yeah I kind of feel like I'm not moving as fast as I did,
           | because the complexity and features grow - constant scope
           | creep due to moving faster.
        
         | energy123 wrote:
         | That would require AIME 2024 going above 100%.
         | 
         | There was always going to be diminishing returns in these
         | benchmarks. It's by construction. It's mathematically
         | impossible for that not to happen. But it doesn't mean the
         | models are getting better at a slower pace.
         | 
         | Benchmark space is just a proxy for what we care about, but
         | don't confuse it for the actual destination.
         | 
         | If you want, you can choose to look at a different set of
         | benchmarks like ARC-AGI-2 or Epoch and observe greater than
         | linear improvements, and forget that these easier benchmarks
         | exist.
        
           | croddin wrote:
           | There is still plenty of room for growth on the ARC-AGI
           | benchmarks. ARC-AGI 2 is still <5% for o3-pro and ARC-AGI 1
           | is only at 59% for o3-pro-high:
           | 
           | "ARC-AGI-1: * Low: 44%, $1.64/task * Medium: 57%, $3.18/task
           | * High: 59%, $4.16/task
           | 
           | ARC-AGI-2: * All reasoning efforts: <5%, $4-7/task
           | 
           | Takeaways: * o3-pro in line with o3 performance * o3's new
           | price sets the ARC-AGI-1 Frontier"
           | 
           | - https://x.com/arcprize/status/1932535378080395332
        
             | saberience wrote:
             | I'm not sure the arcagi are interesting benchmarks, for one
             | they are image based and for two most people I show them
             | too have issues understanding them, and in fact I had
             | issues understanding them.
             | 
             | Given the models don't even see the versions we get to see
             | it doesn't surprise me they have issues we these. It's not
             | hard to make benchmarks that are so hard that humans and
             | Lims can't do.
        
         | jstummbillig wrote:
         | It's hard to be 100% certain, but I am 90% certain that the
         | benchmarks leveling off, at this point, should tell us that we
         | are really quite dumb and simply not good very good at either
         | using or evaluating the technology (yet?).
        
         | XCSme wrote:
         | I remember the saying that from 90% to 99% is a 10x increase in
         | accuracy, but 99% to 99.999% is a 1000x increase in accuracy.
         | 
         | Even though it's a large10% increase first then only a 0.999%
         | increase.
        
       ___________________________________________________________________
       (page generated 2025-06-10 23:00 UTC)