[HN Gopher] Notes on OpenAI o3-mini
       ___________________________________________________________________
        
       Notes on OpenAI o3-mini
        
       Author : dtquad
       Score  : 127 points
       Date   : 2025-02-01 00:24 UTC (7 hours ago)
        
 (HTM) web link (simonwillison.net)
 (TXT) w3m dump (simonwillison.net)
        
       | maxdo wrote:
       | How would you rate it against Claude ? Didn't test it yet, but o1
       | pro didn't perform as good
        
         | pants2 wrote:
         | I've been trying out o3 mini in Cursor today, it seems
         | "smarter" but overall tends to overthink things and if it's not
         | provided with perfect context it's prone to hallucinate.
         | Overall I prefer Sonnet still. It has a certain magic of always
         | making reasonable assumptions and finding simple solutions.
        
           | firecall wrote:
           | As n occasions user and fan of Cursor, it would be good if
           | they could explain what the models are and why the different
           | models exist.
           | 
           | There's no obvious answer of why one should switch to any of
           | them!
        
             | conception wrote:
             | I don't think there's an obvious answer. Try them out and
             | see which works better for your use case.
        
       | kamikazeturtles wrote:
       | There's a huge price difference between o3-mini and o1 ($4.40 vs
       | $60 per million output tokens), what trade-offs in performance
       | would justify such a large price gap?
       | 
       | Are there specific use cases where o1's higher cost is justified
       | anymore?
        
         | zamadatix wrote:
         | Not really, it'll also be replaced by a newer o3 series model
         | in short order.
        
         | benatkin wrote:
         | > Are there specific use cases where o1's higher cost is
         | justified anymore?
         | 
         | Long tail stuff perhaps. Most stuff doesn't resemble a
         | programming benchmark. A newer model thrives despite being
         | small when there is a lot of training data, and with
         | programming benchmarks, like with chess, there is a lot of
         | training data, in part because high quality training data can
         | be synthesized.
        
         | arthurcolle wrote:
         | its the same thing as:
         | 
         | gpt-3.5 -> gpt-4 (gpt-4-32k premium)
         | 
         | "omni" announced (multimodal fusion, initial promise of gpt-4o,
         | but cost effectively distilled down with additional multimodal
         | aspects)
         | 
         | gpt-4o-mini -> gpt-4o (multimodal, realtime)
         | 
         | gpt-4o + "reasoning" exposed via tools in ChatGPT (you can see
         | it in export formats) -> "o" series
         | 
         | o1 -> o1 premium / o1-mini (equivalent of gpt-4 "god model"
         | becoming basis for lots of other stuff)
         | 
         | o1-pro-mode, o1-premium, o1-mini, somewhere in that is the
         | "o1-2024-12-17" model with not streaming, function calling, and
         | structured outputs and vision
         | 
         | now, distilled o1-pro-mode probably is o3-mini and o3-mini-
         | high-mode (the naming is becoming just as bad as android)
         | 
         | its the repeat, take model, scale it up, run evals, detect
         | innefficiencies, retrain, scale, distill, see what's not
         | working. when you find a good little zone in the efficiency
         | frontier, release it with a cool name
        
       | brianbest101 wrote:
       | Open AI really needs to work on their naming conventions for
       | these things.
        
         | benatkin wrote:
         | It's all based on _omni_ which to me has weird religious
         | connotations. It just occurred to me to put it together with
         | sama 's other project, scanning everyone's eyes. That's one
         | aspect of omniscience - keeping track of every soul.
         | 
         | Another thing it seems similar to is how Jeff Bezos registered
         | relentless.com. There seems to be a gap between the ideal
         | branding from the perspective of the creators and branding that
         | makes sense to consumers.
        
       | xnx wrote:
       | Hasn't Gemini pricing been lower than this (or even free) for
       | awhile? https://ai.google.dev/pricing
        
         | BinRoo wrote:
         | Are you insinuating Gemini is similar in performance to
         | o3-mini?
        
           | gerdesj wrote:
           | Are you implying it isn't?
           | 
           | (evidence please, everyone)
        
             | BinRoo wrote:
             | Simple example: o3-mini-high gets this [1] right, whereas
             | Gemini 2.0 Flash 01-21 gets it wrong.
             | 
             | [1] https://chatgpt.com/share/679d9579-5bb8-8008-ac4a-38cef
             | 65b45...
        
               | xnx wrote:
               | Great example. Thank you. Can confirm that none of the
               | Gemini models warned about the exception without
               | prompting.
        
           | xnx wrote:
           | Definitely varies by application, but the blind "taste test"
           | vibes are very good for Gemini:
           | https://lmarena.ai/?leaderboard
        
             | anabab wrote:
             | that reminds me that a week ago there was a (now deleted
             | but has a copy of the content available in the comments)
             | post on Reddit where the author claimed they have attempted
             | manipulating/manipulated voting on lmarena in favor of
             | Gemini to tip the scale on Polymarket where on a question
             | like "which AI model will be the best one by $date" (with
             | the outcome decided based on the scoring on lmarena) they
             | have supposedly made O(USD10k).
             | 
             | Original deleted post: https://old.reddit.com/r/MachineLear
             | ning/comments/1i83mhj/lm...
             | 
             | A copy of the content: https://old.reddit.com/r/MachineLear
             | ning/comments/1i83mhj/lm...
        
           | panarky wrote:
           | I've only had o3-mini for a day, but Gemini 2.0 Flash
           | Thinking is still clearly better for my use cases.
           | 
           | And it's currently free in aistudio.google.com and in the
           | API.
           | 
           | And it handles a million tokens.
        
       | tkgally wrote:
       | At the end of his post, Simon mentions translation between human
       | languages. While maybe not directly related to token limits, I
       | just did a test in which both R1 and o3-mini got worse at
       | translation in the latter half of a long text.
       | 
       | I ran the test on Perplexity Pro, which hosts DeepSeek R1 in the
       | U.S. and which has just added o3-mini as well. The text was a
       | speech I translated a month ago from Japanese to English,
       | preceded by a long prompt specifying the speech's purpose and
       | audience and the sort of style I wanted. (I am a professional
       | Japanese-English translator with nearly four decades of
       | experience. I have been testing and using LLMs for translation
       | since early 2023.)
       | 
       | An initial comparison of the output suggested that, while R1
       | didn't seem bad, o3-mini produced a writing style closer to what
       | I asked for in the prompt--smoother and more natural English.
       | 
       | But then I noticed that the output length was 5,855 characters
       | for R1, 9,052 characters for o3-mini, and 11,021 characters for
       | my own polished version. Comparing the three translations side-
       | by-side with the original Japanese, I discovered that R1 had
       | omitted entire paragraphs toward the end of the speech, and that
       | o3-mini had switched to a strange abbreviated style (using
       | slashes instead of "and" between noun phrases, for example)
       | toward the end as well. The vanilla versions of ChatGPT, Claude,
       | and Gemini that I ran the same prompt and text through a month
       | ago had had none of those problems.
        
         | nycdatasci wrote:
         | This is a great anecdote and I hope others can learn from it.
         | R1, o1, and o3-mini work best on problems that have a "correct"
         | answer (as in code that passes unit tests, or math problems).
         | If multiple professional translators are given the same
         | document to translate, is there a single correct translation?
        
           | jakevoytko wrote:
           | My wife is a professional translator and both revises others'
           | work and gets revised. Based on numerous anecdotes from her,
           | I can promise you that "single correct translation" does not
           | exist.
        
           | tkgally wrote:
           | No. People's tastes and judgments vary too much.
           | 
           | One fundamental area of disagreement is how closely a
           | translation should reflect the content and structure of the
           | original text versus how smooth and natural it should sound
           | in the target language. With languages like Japanese or
           | Chinese translated into English, for example, the vocabulary,
           | grammar, and rhetoric can be very different between the
           | languages. A close literal translation will usually seem
           | awkward or even strange in English. To make the English seem
           | natural, often you have to depart from what the original text
           | says.
           | 
           | Most translators will agree that where to aim on that
           | spectrum should be based on the type of text and the reason
           | for translating it, but they will still disagree about
           | specific word choices. And there are genres for which there
           | is no consensus at all about which approach is best. I have
           | heard heated exchanges between literary scholars about
           | whether or not translations of novels should reflect the
           | original as closely as possible out of respect for the author
           | and the author's cultural context, even if that means the
           | translation seems awkward and difficult to understand to a
           | casual reader.
           | 
           | The ideal, of course, would be translations that are both
           | accurate and natural, but it can be very hard to strike that
           | balance. One way LLMs have been helping me is to suggest
           | multiple rewordings of sentences and paragraphs. Many of
           | their suggestions are no good, but often enough they include
           | wordings that I recognize are better in both fidelity and
           | naturalness compared to what I can come up with on my own.
        
           | ec109685 wrote:
           | Well, the post said o3-mini did great in the beginning, so
           | it's likely something other than reasoning causing the poor
           | performance towards the end.
        
         | WhitneyLand wrote:
         | How far off was o3 from the level of a professional translator
         | (before it started to go off track)?
        
         | simonw wrote:
         | Yikes! Sounds to me like reliable longer form translation is
         | very much not something you can trust to these models. Thanks
         | for sharing.
        
       | johngalt2600 wrote:
       | So far ive been impressed.. seems to be in the same ballpark as
       | r1 and claude for coding. I will have to gather more samples.. in
       | this past week ive changed from using 100% claude exclusively
       | (since 3.5) to hitting all the big boys: claude, r1, 4o (o3 now),
       | and gemini flash. Then ill do a new chat that includes all of
       | their generated solutions for additional context for a refactored
       | final solution.
       | 
       | R1 has upped the ante so Im hoping we continue to get more
       | updates rapidly... they are getting quite good
        
       | submeta wrote:
       | > The model accepts up to 200,000 tokens of input, an improvement
       | on GPT-4o's 128,000.
       | 
       | So finally ChatGPT catches up with Claude which has a 200,000
       | token input limit ever since.
       | 
       | Claude with its projects feature is my go to tool for working on
       | projects that I have to work on for weeks and months. Now I see a
       | possible alternative.
        
       ___________________________________________________________________
       (page generated 2025-02-01 08:00 UTC)