[HN Gopher] OpenAI o3-pro
___________________________________________________________________
OpenAI o3-pro
Author : mfiguiere
Score : 122 points
Date : 2025-06-10 20:15 UTC (2 hours ago)
(HTM) web link (help.openai.com)
(TXT) w3m dump (help.openai.com)
| mmsc wrote:
| I understand that things are moving fast and all, but surely
| the.. 8? models which are currently available is a bit ..
| overwhelming for users that just want to get answers to their
| questions of life? What's the end goal with having so many models
| available?
| Osyris wrote:
| This is a much more expensive model to run and is only
| available to users who pay the most. I don't see an issue.
|
| However, the "plus" plan absolutely could use some trimming.
| bachittle wrote:
| free users don't have this model selector, and probably don't
| care which model they get so 4o is good enough. paid users at
| 20$/month get more models which are better, like o3. paid users
| at 200$/month get the best models that are also costing OpenAI
| the most money, like o3-pro. I think they plan to unify them
| with GPT-5.
| stavros wrote:
| That doesn't help much when we're asymptotically approaching
| GPT-5. We're probably going to be at GPT-4.9999 soon.
| nikcub wrote:
| I'd be curious what proportion of paid users ever switch
| models. I'd guess < 10%
| CamperBob2 wrote:
| I switch to o1-pro on occasion, but it is slow enough that
| I don't use it as much as some of the others. It is a
| reasonably-effective last resort when I'm not getting the
| answer quality that I think should be achievable. It's the
| best available reasoning model from any provider by a
| noticeable margin.
|
| Sounds like o3-pro is even slower, which is fine as long as
| it's better.
|
| o4-mini-high is my usual go-to model if I need something
| better than the default GPT4-du jour. I don't see much
| point in the others and don't understand why they remain
| available. If o3-pro really is consistently better, it will
| move o1-pro into that category for me.
| CuriouslyC wrote:
| If you're not at least switching from 4o to 4.1 you're
| doing it wrong.
| macawfish wrote:
| Overwhelming yet pretty underwhelming
| nickysielicki wrote:
| I just can't believe nobody at the company has enough courage
| to tell their leadership that their naming scheme is completely
| stupid and insane. Four is greater than three, and so four
| should be better than three. The point of a name is to describe
| something so that you don't confuse your users, not to be cute.
| browningstreet wrote:
| At Techcrunch AI last week, the OpenAI guy started his
| presentation by acknowledging that OpenAI knows their naming
| is a problem and they're working on it, but it won't be fixed
| immediately.
| moomin wrote:
| I know they have a deep relationship with Microsoft, but
| perhaps they shouldn't have used Microsoft's product naming
| department.
| orra wrote:
| Zune .NET O3... _shudders_
| simonw wrote:
| Sam Altman has said the same thing on Twitter a few times.
| https://x.com/sama/status/1911906570835022319
|
| > how about we fix our model naming by this summer and
| everyone gets a few more months to make fun of us (which we
| very much deserve) until then?
| nickysielicki wrote:
| I'd prefer for them to just fix it asap instead and then
| keep the existing endpoints around for a year as aliases.
| MallocVoidstar wrote:
| The reason their naming scheme is so bad is because their
| initial attempts at GPT-5 failed in training. It was supposed
| to be done ~1 year ago. Because they'd promised that GPT-5
| would be vastly more intelligent than GPT-4, they couldn't
| just name any random model "GPT-5", so they suddenly had to
| start naming things differently. So now there's GPT-4.5,
| GPT-4.1, the o-series, ...
| transcriptase wrote:
| What's worse is that the app doesn't even have descriptions.
| As if I'm supposed to memorize the use case for each based
| on:
|
| GPT-4o
|
| o3
|
| o4-mini
|
| o4-mini-high
|
| GPT-4.5
|
| GPT-4.1
|
| GPT-4.1-mini
| koakuma-chan wrote:
| Just use o4-mini for everything
| aetherspawn wrote:
| Came here to say this, the naming scheme is ridiculous and is
| getting more impossible to follow each day.
|
| For example the other day they released a supposedly better
| model with a lower number..
| aetherspawn wrote:
| I'd honestly prefer they just have 3 personas of varying
| cost/intelligence: Sam, Elmo and Einstein or something, and
| then tack on the date, elmo-2025-1 and silently delete the
| old ones.
| levocardia wrote:
| There's a humorous version of Poe's law that says "any
| sufficiently genuine attempt to explain the differences between
| OpenAI's models is indistinguishable from parody"
| paxys wrote:
| > users that just want to get answers to their questions of
| life
|
| Those users go to chat.openai.com (or download the app), type
| text in the box and click send.
| AtlasBarfed wrote:
| I'd like one to do my test use case:
|
| Port unix-sed from c to java with a full test suite and all
| options supported.
|
| Somewhere between "it answers questions of life" and "it beats
| PhDs at math questions", I'd like to see one LLM take this,
| IMO, rather "pure" language task and succeeed.
|
| It is complicated, but it isn't complex. It's string operations
| with a deep but not that deep expression system and flag set.
|
| It is well-described and documented on the internet, and
| presumably training sets. It is succinctly described as a
| problem that virtually all computer coders would understand
| what it entailed if it were assigned to them. It is drudgerous,
| showing the opportunity for LLMs to show how they would improve
| true productivity.
|
| GPT fails to do anything other than the most basic substitute
| operations. Claude was only slightly better, but to its
| detriment hallucinated massive amounts and made fake passing
| test cases that didn't even test the code.
|
| The reaction I get to this test is ambivalence, but IMO if LLMs
| could help port entire software packages between languages with
| similar feature sets (aside from Turing Completeness), then
| software cross-use would explode, and maybe we could port
| "vulnerable" code to "safe" Rust en masse.
|
| I get it, it's not what they are chasing customer-wise. They
| want to write (in n-gate terms) webcrap.
| CamperBob2 wrote:
| How does the latest Gemini 2.5 Pro Ultra Flash Max Hemi XLT
| release do on that task? It obviously demands a massive
| context window.
| resters wrote:
| Models are used for actual tasks where predictable behavior is
| a benefit. Models are also used on cutting-edge tasks where
| smarter/better outputs are highly valued. Some applications
| value speed and so a new, smaller/cheaper model can be just
| right.
|
| I think the naming scheme is just fine and is very
| straightforward to anyone who pays the slightest bit of
| attention.
| manmal wrote:
| The benchmarks don't look _that_ much better than o3. Does that
| mean Pro models are just incrementally better than base models,
| or are we approaching the higher end of a sigmoid function, with
| performance gains leveling off?
| bachittle wrote:
| it's the same model as o3, just with thinking tokens turned up
| to the max.
| Tiberium wrote:
| That's simply not true, it's not just "max thinking budget
| o3" just like o1-pro wasn't "max thinking budget o1". The
| specifics are unknown, but they might be doing multiple model
| generations and then somehow picking the best answer each
| time? Of course that's a gross simplification, but some
| assume that they do it this way.
| cdblades wrote:
| > That's simply not true, it's not just "max thinking
| budget o3"
|
| > The specifics are unknown, but they might...
|
| Hold up.
|
| > but some assume that they do it this way.
|
| Come on now.
| MallocVoidstar wrote:
| Good luck finding the tweet (I can't) but at least one
| OpenAI engineer has said that o1-pro was _not_ just 'o1
| thinking longer'.
| boole1854 wrote:
| I also don't have that tweet saved, but I do remember it.
| PhilippGille wrote:
| This one? Found with Kagi Assistant.
|
| https://x.com/michpokrass/status/1869102222598152627
|
| It says:
|
| > hey aidan, not a miscommunication, they are different
| products! o1 pro is a different implementation and not
| just o1 with high reasoning.
| firejake308 wrote:
| > "We also introduced OpenAI o3-pro in the API--a version
| of o3 that uses more compute to think harder and provide
| reliable answers to challenging problems"
|
| Sounds like it is just o3 with higher thinking budget to me
| dyauspitr wrote:
| Don't they have a full fledged version of o4 somewhere
| internally at this point?
| ankit219 wrote:
| They do it seems. o1 and o3 were based on the same base
| model. o4 is going to be based on a newer (and perhaps
| smarter) base model.
| tiahura wrote:
| So, upgrade to Teams and pay the $50? Plus more usage of o3.
| Seems like it might be a shot at the $100 claude max?
| dog436zkj3p7 wrote:
| What do you mean with "pay the $50"?
|
| Also, does anybody know what limits o3-pro has under the team
| plan? I don't see it available in the model picker at all (on
| team).
| carmelion wrote:
| Jl App
| cluckindan wrote:
| "designed to think longer"
|
| Translation: designed to fool people into believing it's thinking
| longer, or at all.
| Workaccount2 wrote:
| With Gemini 2.5 in AI studio you can now increase the amount of
| thinking tokens, and it definitely makes a difference. O3 pro
| is most likely O3 with an expanded thinking token budget.
| energy123 wrote:
| Isn't that just increasing the upper bound on thinking
| tokens, which is rarely hit even on much lower levels?
| cluckindan wrote:
| It is not thinking. It is trying to deceive you. The
| "reasoning" it outputs does not have a causal relationship
| with the end result.
| benxh wrote:
| The longer "it" reasons, the more attention sinks are used
| to come to a "better" final output.
| manmal wrote:
| I've looked up attention sinks and can't figure out how
| you're using the term here. It sounds interesting, would
| you care to elaborate?
| Sohcahtoa82 wrote:
| > The "reasoning" it outputs does not have a causal
| relationship with the end result.
|
| It absolutely does.
|
| Now, we can argue all about whether it's truly "reasoning",
| but I've certainly seen cases where if you ask it a
| question but say "Give just the answer", it'll consistently
| give a wrong answer, whereas if you let it explain its
| thought process before giving a final answer, it'll
| consistently get it right.
|
| LLMs are at their core just next-token guessing machines.
| By allowing them to output extra "reasoning" tokens, it can
| prime the context to give better answers.
|
| Think of it like solving an algebraic equation. Humans
| can't typically solve any but the most trivial equations in
| a single step, and neither can an LLM. But like a human, an
| LLM can solve one if it takes it one step at a time.
| dbbk wrote:
| Or my favourite, tell Claude to "ultrathink"
| ChrisArchitect wrote:
| Related:
|
| _OpenAI dropped the price of o3 by 80%_
|
| https://news.ycombinator.com/item?id=44239359
| swyx wrote:
| here's a nice user review we published:
| https://www.latent.space/p/o3-pro
|
| sama's highlight[0]:
|
| > "The plan o3 gave us was plausible, reasonable; but the plan o3
| Pro gave us was specific and rooted enough that it actually
| changed how we are thinking about our future."
|
| I kept nudging the team to go the whole way to just let o3 be
| their CEO but they didn't bite yet haha
|
| 0: https://x.com/sama/status/1932533208366608568
| tomComb wrote:
| Big fan swyx, but both here and in the article there is some
| bragging about being quoted by sama, and while I acknowledge
| that that's not out of the ordinary, I'm concerned about where
| it leads: what it takes to get quoted by sama (or similar
| interested party) is saying something good about his product,
| and having a decent follower count.
|
| Dangerous incentives IMO.
| WhitneyLand wrote:
| So, we currently have o4-mini and o4-mini-high, which represent
| medium and high usage of "thinking" or use of reasoning tokens.
|
| This announcement adds o3-pro, which pairs with o3 in the same
| way the o4 models go together.
|
| It should be called o3-high, but to align with the $200 pro
| membership it's called pro instead.
|
| That said o3 is already an incredibly powerful model. I prefer it
| over the new Anthropic 4 models and Gemini 2.5. It's raw power
| seems similar to those others, but it's so good at inline tool
| use it usually comes out ahead overall.
|
| Any non-trivial code generation/editing should be using an
| advanced reasoning model, or else you're losing time fixing more
| glitches or missing out on better quality solutions.
|
| Of course the caveat is cost, but there's value on the frontier.
| boole1854 wrote:
| No, this doesn't seem to be correct, although confusion
| regarding model names is understandable.
|
| o4-mini-high is the label on chatgpt.com for what in the API is
| called o4-mini with reasoning={"effort": "high"}. Whereas
| o4-mini on chatgpt.com is the same thing as
| reasoning={"effort": "medium"} in the API.
|
| o3 can also be run via the API with reasoning={"effort":
| "high"}.
|
| o3-pro is _different_ than o3 with high reasoning. It has a
| separate endpoint, and it runs for much longer.
|
| See https://platform.openai.com/docs/guides/reasoning?api-
| mode=r...
| chad1n wrote:
| The guys in the other thread who said that OpenAI might have
| quantized o3 and that's how they reduced the price might be
| right. This o3-pro might be the actual o3-preview from the
| beginning and the o3 might be just a quantized version. I wish
| someone benchmarks all of these models to check for drops in
| quality.
| simonw wrote:
| That's definitely not the case here. The new o3-pro is _slow_ -
| it took two minutes just to draw me an SVG of a pelican riding
| a bicycle. o3-preview was much faster than that.
|
| https://simonwillison.net/2025/Jun/10/o3-pro/
| k2xl wrote:
| Not distilled, same model. https://x.com/therealadamg/status/
| 1932534244774957121?s=46&t...
| CamperBob2 wrote:
| Would you say this is the best cycling pelican to date? I
| don't remember any of the others looking better than this.
|
| Of course by now it'll be in-distribution. Time for a new
| benchmark...
| jstummbillig wrote:
| I love that we are in the timeline where we are somewhat
| seriously evaluating probably super human intelligence by
| their ability to draw a svg of a cycling pelican.
| CamperBob2 wrote:
| I still remember my jaw hitting the floor when the first
| DALL-E paper came out, with the baby daikon radish
| walking a dog. How the actual _fuck_...? Now we 're
| probably all too jaded to fully appreciate the next
| advance of that magnitude, whatever that turns out to be.
|
| E.g., the pelicans all look pretty cruddy including this
| one, but the fact that they are being delivered in .SVG
| is a bigger deal than the quality of the artwork itself,
| IMHO. This isn't a diffusion model, it's an
| autoregressive transformer imitating one. The wonder
| isn't that it's done badly, it's that it's happening at
| all.
| AstroBen wrote:
| That's one good looking pelican
| gkamradt wrote:
| o3-pro is not the same as the o3-preview that was shown in Dec
| '24. OpenAI confirmed this for us. More on that here:
| https://x.com/arcprize/status/1932535380865347585
| weinzierl wrote:
| Is there a way to figure out likely quantization from the
| output. I mean, does quantization degrade output quality in
| certain ways that are different from other modification of
| other model properties (e.g. size or distillation)?
| DanMcInerney wrote:
| I'm really hoping GPT5 is a larger jump in metrics than the last
| several releases we've seen like Claude3.5 - Claude4 or o3-mini-
| high to o3-pro. Although I will preface that with the fact I've
| been building agents for about a year now and despite the
| benchmarks only showing slight improvement, I have seen that each
| new generation feels actively better at exactly the same tasks I
| gave the previous generation.
|
| It would be interesting if there was a model that was
| specifically trained on task-oriented data. It's my understanding
| they're trained on all data available, but I wonder if it can be
| fine-tuned or given some kind of reinforcement learning on
| breaking down general tasks to specific implementations.
| Essentially an agent-specific model.
| codingwagie wrote:
| I'm seeing big advances that arent shown in the benchmarks, I
| can simply build software now that I couldnt build before. The
| level of complexity that I can manage and deliver is higher.
| shmoogy wrote:
| Yeah I kind of feel like I'm not moving as fast as I did,
| because the complexity and features grow - constant scope
| creep due to moving faster.
| energy123 wrote:
| That would require AIME 2024 going above 100%.
|
| There was always going to be diminishing returns in these
| benchmarks. It's by construction. It's mathematically
| impossible for that not to happen. But it doesn't mean the
| models are getting better at a slower pace.
|
| Benchmark space is just a proxy for what we care about, but
| don't confuse it for the actual destination.
|
| If you want, you can choose to look at a different set of
| benchmarks like ARC-AGI-2 or Epoch and observe greater than
| linear improvements, and forget that these easier benchmarks
| exist.
| croddin wrote:
| There is still plenty of room for growth on the ARC-AGI
| benchmarks. ARC-AGI 2 is still <5% for o3-pro and ARC-AGI 1
| is only at 59% for o3-pro-high:
|
| "ARC-AGI-1: * Low: 44%, $1.64/task * Medium: 57%, $3.18/task
| * High: 59%, $4.16/task
|
| ARC-AGI-2: * All reasoning efforts: <5%, $4-7/task
|
| Takeaways: * o3-pro in line with o3 performance * o3's new
| price sets the ARC-AGI-1 Frontier"
|
| - https://x.com/arcprize/status/1932535378080395332
| saberience wrote:
| I'm not sure the arcagi are interesting benchmarks, for one
| they are image based and for two most people I show them
| too have issues understanding them, and in fact I had
| issues understanding them.
|
| Given the models don't even see the versions we get to see
| it doesn't surprise me they have issues we these. It's not
| hard to make benchmarks that are so hard that humans and
| Lims can't do.
| jstummbillig wrote:
| It's hard to be 100% certain, but I am 90% certain that the
| benchmarks leveling off, at this point, should tell us that we
| are really quite dumb and simply not good very good at either
| using or evaluating the technology (yet?).
| XCSme wrote:
| I remember the saying that from 90% to 99% is a 10x increase in
| accuracy, but 99% to 99.999% is a 1000x increase in accuracy.
|
| Even though it's a large10% increase first then only a 0.999%
| increase.
___________________________________________________________________
(page generated 2025-06-10 23:00 UTC)