[HN Gopher] Notes on OpenAI o3-mini
___________________________________________________________________
Notes on OpenAI o3-mini
Author : dtquad
Score : 127 points
Date : 2025-02-01 00:24 UTC (7 hours ago)
(HTM) web link (simonwillison.net)
(TXT) w3m dump (simonwillison.net)
| maxdo wrote:
| How would you rate it against Claude ? Didn't test it yet, but o1
| pro didn't perform as good
| pants2 wrote:
| I've been trying out o3 mini in Cursor today, it seems
| "smarter" but overall tends to overthink things and if it's not
| provided with perfect context it's prone to hallucinate.
| Overall I prefer Sonnet still. It has a certain magic of always
| making reasonable assumptions and finding simple solutions.
| firecall wrote:
| As n occasions user and fan of Cursor, it would be good if
| they could explain what the models are and why the different
| models exist.
|
| There's no obvious answer of why one should switch to any of
| them!
| conception wrote:
| I don't think there's an obvious answer. Try them out and
| see which works better for your use case.
| kamikazeturtles wrote:
| There's a huge price difference between o3-mini and o1 ($4.40 vs
| $60 per million output tokens), what trade-offs in performance
| would justify such a large price gap?
|
| Are there specific use cases where o1's higher cost is justified
| anymore?
| zamadatix wrote:
| Not really, it'll also be replaced by a newer o3 series model
| in short order.
| benatkin wrote:
| > Are there specific use cases where o1's higher cost is
| justified anymore?
|
| Long tail stuff perhaps. Most stuff doesn't resemble a
| programming benchmark. A newer model thrives despite being
| small when there is a lot of training data, and with
| programming benchmarks, like with chess, there is a lot of
| training data, in part because high quality training data can
| be synthesized.
| arthurcolle wrote:
| its the same thing as:
|
| gpt-3.5 -> gpt-4 (gpt-4-32k premium)
|
| "omni" announced (multimodal fusion, initial promise of gpt-4o,
| but cost effectively distilled down with additional multimodal
| aspects)
|
| gpt-4o-mini -> gpt-4o (multimodal, realtime)
|
| gpt-4o + "reasoning" exposed via tools in ChatGPT (you can see
| it in export formats) -> "o" series
|
| o1 -> o1 premium / o1-mini (equivalent of gpt-4 "god model"
| becoming basis for lots of other stuff)
|
| o1-pro-mode, o1-premium, o1-mini, somewhere in that is the
| "o1-2024-12-17" model with not streaming, function calling, and
| structured outputs and vision
|
| now, distilled o1-pro-mode probably is o3-mini and o3-mini-
| high-mode (the naming is becoming just as bad as android)
|
| its the repeat, take model, scale it up, run evals, detect
| innefficiencies, retrain, scale, distill, see what's not
| working. when you find a good little zone in the efficiency
| frontier, release it with a cool name
| brianbest101 wrote:
| Open AI really needs to work on their naming conventions for
| these things.
| benatkin wrote:
| It's all based on _omni_ which to me has weird religious
| connotations. It just occurred to me to put it together with
| sama 's other project, scanning everyone's eyes. That's one
| aspect of omniscience - keeping track of every soul.
|
| Another thing it seems similar to is how Jeff Bezos registered
| relentless.com. There seems to be a gap between the ideal
| branding from the perspective of the creators and branding that
| makes sense to consumers.
| xnx wrote:
| Hasn't Gemini pricing been lower than this (or even free) for
| awhile? https://ai.google.dev/pricing
| BinRoo wrote:
| Are you insinuating Gemini is similar in performance to
| o3-mini?
| gerdesj wrote:
| Are you implying it isn't?
|
| (evidence please, everyone)
| BinRoo wrote:
| Simple example: o3-mini-high gets this [1] right, whereas
| Gemini 2.0 Flash 01-21 gets it wrong.
|
| [1] https://chatgpt.com/share/679d9579-5bb8-8008-ac4a-38cef
| 65b45...
| xnx wrote:
| Great example. Thank you. Can confirm that none of the
| Gemini models warned about the exception without
| prompting.
| xnx wrote:
| Definitely varies by application, but the blind "taste test"
| vibes are very good for Gemini:
| https://lmarena.ai/?leaderboard
| anabab wrote:
| that reminds me that a week ago there was a (now deleted
| but has a copy of the content available in the comments)
| post on Reddit where the author claimed they have attempted
| manipulating/manipulated voting on lmarena in favor of
| Gemini to tip the scale on Polymarket where on a question
| like "which AI model will be the best one by $date" (with
| the outcome decided based on the scoring on lmarena) they
| have supposedly made O(USD10k).
|
| Original deleted post: https://old.reddit.com/r/MachineLear
| ning/comments/1i83mhj/lm...
|
| A copy of the content: https://old.reddit.com/r/MachineLear
| ning/comments/1i83mhj/lm...
| panarky wrote:
| I've only had o3-mini for a day, but Gemini 2.0 Flash
| Thinking is still clearly better for my use cases.
|
| And it's currently free in aistudio.google.com and in the
| API.
|
| And it handles a million tokens.
| tkgally wrote:
| At the end of his post, Simon mentions translation between human
| languages. While maybe not directly related to token limits, I
| just did a test in which both R1 and o3-mini got worse at
| translation in the latter half of a long text.
|
| I ran the test on Perplexity Pro, which hosts DeepSeek R1 in the
| U.S. and which has just added o3-mini as well. The text was a
| speech I translated a month ago from Japanese to English,
| preceded by a long prompt specifying the speech's purpose and
| audience and the sort of style I wanted. (I am a professional
| Japanese-English translator with nearly four decades of
| experience. I have been testing and using LLMs for translation
| since early 2023.)
|
| An initial comparison of the output suggested that, while R1
| didn't seem bad, o3-mini produced a writing style closer to what
| I asked for in the prompt--smoother and more natural English.
|
| But then I noticed that the output length was 5,855 characters
| for R1, 9,052 characters for o3-mini, and 11,021 characters for
| my own polished version. Comparing the three translations side-
| by-side with the original Japanese, I discovered that R1 had
| omitted entire paragraphs toward the end of the speech, and that
| o3-mini had switched to a strange abbreviated style (using
| slashes instead of "and" between noun phrases, for example)
| toward the end as well. The vanilla versions of ChatGPT, Claude,
| and Gemini that I ran the same prompt and text through a month
| ago had had none of those problems.
| nycdatasci wrote:
| This is a great anecdote and I hope others can learn from it.
| R1, o1, and o3-mini work best on problems that have a "correct"
| answer (as in code that passes unit tests, or math problems).
| If multiple professional translators are given the same
| document to translate, is there a single correct translation?
| jakevoytko wrote:
| My wife is a professional translator and both revises others'
| work and gets revised. Based on numerous anecdotes from her,
| I can promise you that "single correct translation" does not
| exist.
| tkgally wrote:
| No. People's tastes and judgments vary too much.
|
| One fundamental area of disagreement is how closely a
| translation should reflect the content and structure of the
| original text versus how smooth and natural it should sound
| in the target language. With languages like Japanese or
| Chinese translated into English, for example, the vocabulary,
| grammar, and rhetoric can be very different between the
| languages. A close literal translation will usually seem
| awkward or even strange in English. To make the English seem
| natural, often you have to depart from what the original text
| says.
|
| Most translators will agree that where to aim on that
| spectrum should be based on the type of text and the reason
| for translating it, but they will still disagree about
| specific word choices. And there are genres for which there
| is no consensus at all about which approach is best. I have
| heard heated exchanges between literary scholars about
| whether or not translations of novels should reflect the
| original as closely as possible out of respect for the author
| and the author's cultural context, even if that means the
| translation seems awkward and difficult to understand to a
| casual reader.
|
| The ideal, of course, would be translations that are both
| accurate and natural, but it can be very hard to strike that
| balance. One way LLMs have been helping me is to suggest
| multiple rewordings of sentences and paragraphs. Many of
| their suggestions are no good, but often enough they include
| wordings that I recognize are better in both fidelity and
| naturalness compared to what I can come up with on my own.
| ec109685 wrote:
| Well, the post said o3-mini did great in the beginning, so
| it's likely something other than reasoning causing the poor
| performance towards the end.
| WhitneyLand wrote:
| How far off was o3 from the level of a professional translator
| (before it started to go off track)?
| simonw wrote:
| Yikes! Sounds to me like reliable longer form translation is
| very much not something you can trust to these models. Thanks
| for sharing.
| johngalt2600 wrote:
| So far ive been impressed.. seems to be in the same ballpark as
| r1 and claude for coding. I will have to gather more samples.. in
| this past week ive changed from using 100% claude exclusively
| (since 3.5) to hitting all the big boys: claude, r1, 4o (o3 now),
| and gemini flash. Then ill do a new chat that includes all of
| their generated solutions for additional context for a refactored
| final solution.
|
| R1 has upped the ante so Im hoping we continue to get more
| updates rapidly... they are getting quite good
| submeta wrote:
| > The model accepts up to 200,000 tokens of input, an improvement
| on GPT-4o's 128,000.
|
| So finally ChatGPT catches up with Claude which has a 200,000
| token input limit ever since.
|
| Claude with its projects feature is my go to tool for working on
| projects that I have to work on for weeks and months. Now I see a
| possible alternative.
___________________________________________________________________
(page generated 2025-02-01 08:00 UTC)