hngopher.com

       [HN Gopher] OpenAI O3-Mini
       ___________________________________________________________________
        
       OpenAI O3-Mini
        
       Author : johnneville
       Score  : 693 points
       Date   : 2025-01-31 19:08 UTC (12 hours ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | brcmthrowaway wrote:
       | Gamechanger?
        
         | aspect0545 wrote:
         | It's always a game changer, isn't it?
        
         | colonelspace wrote:
         | Yep.
         | 
         | And groundbreaking.
        
           | maeil wrote:
           | _It changes the landscape with its multifaceted approach_.
        
         | sss111 wrote:
         | This time, it is a save-face release, especially because Azure,
         | AWS, and OpenRouter have started offering DeepSeek
        
         | 42lux wrote:
         | Is it AGI yet?
        
       | sss111 wrote:
       | So far, it seems like this is the hierarchy
       | 
       | o1 > GPT-4o > o3-mini > o1-mini > GPT-4o-mini
       | 
       | o3 mini system card: https://cdn.openai.com/o3-mini-system-
       | card.pdf
        
         | forrestthewoods wrote:
         | OpenAI needs a new branding scheme.
        
           | airstrike wrote:
           | ChatGPT Series X O one
        
           | nejsjsjsbsb wrote:
           | The Llama folk know how. Good old 90s version scheme.
        
         | losvedir wrote:
         | What about "o1 Pro mode". Is that just o1 but with more
         | reasoning time, like this new o3-mini's different amount of
         | reasoning options?
        
           | MichaelBurge wrote:
           | o1-pro is a different model than o1.
        
             | JohnPrine wrote:
             | I don't think this is true
        
             | losvedir wrote:
             | Are you sure? Do you have any source for that? In this
             | article[0] that was discussed here on HN this week, they
             | say (claim):
             | 
             | > In fact, the O1 model used in OpenAI's ChatGPT Plus
             | subscription for $20/month is basically the same model as
             | the one used in the O1-Pro model featured in their new
             | ChatGPT Pro subscription for 10x the price ($200/month,
             | which raised plenty of eyebrows in the developer
             | community); the main difference is that O1-Pro thinks for a
             | lot longer before responding, generating vastly more COT
             | logic tokens, and consuming a far larger amount of
             | inference compute for every response.
             | 
             | Granted "basically" is pulling a lot of weight there, but
             | that was the first time I'd seen anyone speculate either
             | way.
             | 
             | [0] https://youtubetranscriptoptimizer.com/blog/05_the_shor
             | t_cas...
        
           | bobjordan wrote:
           | I have been paying $200 per month for 01-pro mode and I am
           | very disappointed right now because they have completely
           | replaced the model today. It used to think for 1-5 minutes
           | and deliver an unbelievably useful one-shot answer. Now, it
           | only thinks for 7 seconds just like the 03-mini model and I
           | can't tell the difference in the answers. I hope this is just
           | a day 1 implementation bug but I suspect they have just
           | decided to throw the $200 per month customers under the bus
           | so that they'd have more capacity to launch the 03 model for
           | everybody. I can't tell the difference between the models now
           | and it is definitely not because the free 03 model delivers
           | the quality that 01-pro-mode had! I'm so disappointed!
        
         | sho_hn wrote:
         | I think OpenAI really needs to rethink its product naming,
         | especially now that they have a portfolio where there's no such
         | clear hierarchy, but they have a place along different axis
         | (speed, cost, reasoning, capabilities, etc).
         | 
         | Your summary attempt e.g. also misses o3-mini vs o3-mini-high.
         | Lots of trade-ofs.
        
           | echelon wrote:
           | It's like AWS SKU naming (`c5d.metal`, `p5.48xlarge`, etc.),
           | except non-technical consumers are expected to understand it.
        
             | maeil wrote:
             | Have you seen Azure VM SKU naming? It's.. impressive.
        
               | buildbot wrote:
               | And it doesn't even line up with the actual instances
               | you'll be offered. At one point I was using some random
               | Nvidia A10 node that was supposed to be similar to
               | Standard_NV36adms_A10_v5, but was an NC series for some
               | reason with slightly different letters...
        
             | nejsjsjsbsb wrote:
             | Those are not names but hashes used to look up the specs.
        
               | echelon wrote:
               | I was thinking we might treat model names analogously,
               | but their specs can be moving targets.
        
           | sss111 wrote:
           | Yeah I tried my best :(
           | 
           | I think they could've borrowed a page out of Apple's book,
           | even mountain names would be better. Plus Sonoma, Ventura,
           | and Yosemite are cool names.
        
           | rf15 wrote:
           | They're strongly tied to Microsoft, so confusing branding is
           | to be expected.
        
             | chris_va wrote:
             | One of my favorite parodies:
             | https://www.youtube.com/watch?v=EUXnJraKM3k
        
             | ANewFormation wrote:
             | I can't wait for Project Unify which just devolves into a
             | brand new p3-mini type naming convention. It's pretty much
             | identical to the o3-mini, except the API is changed just
             | enough to be completely incompatible and it crashes on any
             | query using a word with more than two syllables. Fix coming
             | soon, for 4 years so far.
             | 
             | On the bright side the app now has curved edges!
        
             | ngokevin wrote:
             | It needs to be clowned on here:
             | 
             | - Xbox, Xbox 360, Xbox One, Xbox One S/X, Xbox Series S/X
             | 
             | - Windows 3.1...98, 2000, ME, XP, Vista, 7, 8, 10
             | 
             | I guess it's better than headphones names (QC35,
             | WH-1000XM3, M50x, HD560s).
        
             | nejsjsjsbsb wrote:
             | Flashbacks of the .NET zoo. At least they reigned that in.
        
           | kaaskop wrote:
           | Yeah their naming scheme is super confusing, I honestly
           | confuse them all the time.
        
           | Euphorbium wrote:
           | They can still do models o3o, oo3 and 3oo. Mini-o3o-high, not
           | to be confused with mini-O3o-high (the first o is capital).
        
             | brookst wrote:
             | You're thinking too small. What about o10, O1o, o3-m1n1?
        
             | margalabargala wrote:
             | They should just start encoding the model ID in trinary
             | using o, O, and 0.
             | 
             | Model 00oOo is better than Model 0OoO0!
        
           | nullpoint420 wrote:
           | Can't wait for the eventual rename to GPT Core, GPT Plus, GPT
           | Pro, and GPT Pro Max models!
           | 
           | I can see it now:
           | 
           | > Unlock our industry leading reasoning features by upgrading
           | to the GPT 4 Pro Max plan.
        
             | wlll wrote:
             | I think I'll wait for the GTI model myself.
        
             | raphman wrote:
             | Oh, I'll probably wait for GPT 4 Pro Max v2 NG (improved)
        
             | FridgeSeal wrote:
             | OpenAI chatGPT Pro Max XS Core, not to be confused with
             | ChatGPT Max S Pro Net Core X, or ChatGPT Pro Max XS
             | Professional CoPilot Edition.
        
             | jmkni wrote:
             | ngl I'd find that easier to follow lol
        
             | kugelblitz wrote:
             | Had the same problem while trying to decide which Roborock
             | device to get. There's the S series, Saros series, Q Series
             | and the Qrevo. And from the Qrevo, there's Qrevo Curv,
             | Edge, Slim, Master, MaxV, Plus, Pro, S and without
             | anything. The S Series had S8, S8+, S8 Pro Ultra, S8 Max
             | Ultra, S8 MaxV Ultra. It was so confusing.
        
           | __MatrixMan__ wrote:
           | Careful what you wish for. Next thing you know they're going
           | to have names like Betsy and be full of unique quirky
           | behavior to help remind us that they're different people.
        
         | LoveMortuus wrote:
         | How would the DeepSeek fit into this?
         | 
         | Or can it not compare? I don't know much about this stuff, but
         | I've heard recently many people talk about DeepSeek and how
         | unexpected it was.
        
           | sss111 wrote:
           | Deepseek V3 is equivalent to 4o. Deepseek R1 is equivalent to
           | o1 (if not better)
           | 
           | I think someone should just build an AI model comparing
           | website at this point. Include all benchmarks and pricing
        
             | jsk2600 wrote:
             | This one is good: https://artificialanalysis.ai/
        
               | withinboredom wrote:
               | Looks like this only compares commercial models, and not
               | the ones I can download and actually run locally.
        
               | TuxSH wrote:
               | https://livebench.ai/#/
               | 
               | My experience is as follows:
               | 
               | - "Reason" toggle just got enabled for me as a free tier
               | user of ChatGPT's webchat. Apparently this is o3-mini - I
               | have Copilot Pro (offered to me for free), which
               | apparently has o1 too (as well as Sonnet, etc.)
               | 
               | From my experience DeepSeek R1 (webchat) is more
               | expressive, more creative and its writing style is
               | leagues better than OpenAI's models, however it under-
               | performs Sonnet when changing code ("code completion").
               | 
               | Comparison screenshots for prompt "In C++, is a reference
               | to "const C" a "const reference to C"?":
               | https://imgur.com/a/c-is-reference-to-const-c-const-
               | referenc...
               | 
               | tl;dr keep using Claude for code and DeepSeek webchat for
               | technical questions
        
             | dutchbookmaker wrote:
             | I had resubscribed to use o1 2 weeks ago and haven't even
             | logged in this week because of R1.
             | 
             | One thing I notice that is huge is being able to see the
             | chain of thought lets me see when my prompt was lacking and
             | the model is a bit confused on what I want.
             | 
             | If I was anymore impressed with R1 I would probably start
             | getting accused of being a CCP shill or wumao lol.
             | 
             | With that said, I think it is very hard to compare models
             | for your own use case. I do suspect there is a shiny new
             | toy bias with all this too.
             | 
             | Poor Sonnet 3.5. I have neglected it so much lately I
             | actually don't know if I have a subscription or not right
             | now.
             | 
             | I do expect an Anthropic reasoning model though to blow
             | everything else away.
        
         | gundmc wrote:
         | If this is the hierarchy, why does 4o score so much higher than
         | o1 on LLM Arena?
         | 
         | Worrisome for OpenAI that Gemini's mini/flash reasoning model
         | outscores both o1 and 4o handily.
        
           | crazysim wrote:
           | Is it possible people are voting for speed of responsiveness
           | too?
        
             | kgeist wrote:
             | I suspect people on LLM Arena don't ask complex questions
             | too often, and reasoning models seem to perform worse than
             | simple models when the goal is just casual conversation or
             | retrieving embedded knowledge. Reasoning models probably
             | 'overthink' in such cases. And slower, too.
        
           | energy123 wrote:
           | o1 on LLM Arena often times out (network error) while
           | thinking. But they still allow you to vote and they make it
           | seem as if your vote is registered.
        
         | ActVen wrote:
         | I really wish they would open up the reasoning effort toggle on
         | o1 API. o1 Pro Mode is still the best overall model I have used
         | for many complex tasks.
        
           | bobjordan wrote:
           | Have you tried the o1-pro mode model today, because now it
           | sucks!
        
         | ALittleLight wrote:
         | That seems very bad. What's the point of a new model that's
         | worse than 4o? I guess it's cheaper in the API and a bit better
         | at coding - but, this doesn't seem compelling.
         | 
         | With DeepSeek I heard OpenAI saying the plan was to move
         | releases on models that were meaningfully better than the
         | competition. Seems like what we're getting is the scheduled
         | releases that are worse than the current versions.
        
           | thegeomaster wrote:
           | It's quite a bit better than coding --- they hint that it can
           | tie o1's performance for coding, which already benchmarks
           | higher than 4o. And it's significantly cheaper, and
           | presumably faster. I believe API costs account for the vast
           | majority of COGS at most today's AI startups, so they would
           | be very motivated to switch to a cheaper model that has
           | similar performance.
        
             | mgens wrote:
             | Right. For large-volume requests that use reasoning this
             | will be quite useful. I have a task that requires the LLM
             | to convert thousands of free-text statements into SQL
             | select statements, and o3-mini-high is able to get many of
             | the more complicated ones that GPT-4o and Sonnet 3.5 failed
             | at. So I will be switching this task to either o3-mini or
             | DeepSeek-R1.
        
         | usaar333 wrote:
         | For non-stem perhaps.
         | 
         | For math/coding problems, o3 mini is tied if not better than
         | o1.
        
         | koakuma-chan wrote:
         | You cannot compare GPT-4o and o*(-mini) because GPT-4o is not a
         | reasoning model.
        
           | lxgr wrote:
           | Sure you can. "Reasoning" is ultimately an implementation
           | detail, and the only thing that matters for capabilities is
           | results, not process.
        
             | koakuma-chan wrote:
             | By "reasoning" I meant the fact that o*(-mini) does "chain-
             | of-thought", in other words, it prompts itself to "reason"
             | before responding to you, whereas GPT-4o(-mini) just
             | directly responds to your prompt. Thus, it is not
             | appropriate to compare o*(-mini) and GPT-4o(-mini) unless
             | you implement "chain-of-thought" for GPT-4o(-mini) and
             | compare that with o*(-mini). See also:
             | https://docs.anthropic.com/en/docs/build-with-
             | claude/prompt-...
        
               | wordpad25 wrote:
               | That's like saying you can't compare a sedan to a truck.
               | 
               | Sure you can.
               | 
               | Even though one is more appropriate for certain tasks
               | than the other.
        
               | dutchbookmaker wrote:
               | It is a nuanced point but what is better, a sedan or a
               | truck? I think we are still at that stage of the
               | conversation so it doesn't make much sense.
               | 
               | I do think it is a good metaphor for how all this shakes
               | out though in time.
        
         | thot_experiment wrote:
         | at least if i ran the company you'd know that
         | 
         | ChatGPTasdhjf-final-final-use_this_one.pt > ChatGPTasdhjf-
         | final.pt > ChatGPTasdhjf.pt > ChatGPTasd.pt> ChatGPT.pt
        
           | idonotknowwhy wrote:
           | Did you just ls my /workspace dir? Lol
        
         | withinboredom wrote:
         | yeah, you can def tell they are partnered with Microsoft.
        
         | lumost wrote:
         | I actually switched back from o1-preview to GPT-4o due to
         | tooling integration and web search. I find that more often than
         | not, the ability of GPT-4o to use these tools outweighs o1's
         | improved accuracy.
        
         | singularity2001 wrote:
         | no the reasoning models should not directly be compared with
         | the normal models: they often take 10 times as long to answer
         | which only makes sense for difficult questions
        
       | vincentpants wrote:
       | Wow, it got to the top of the front page so fast! Weird!
        
         | dang wrote:
         | I took a quick look at the data and FWIW the votes look legit
         | to me, if that's what you were wondering.
        
           | throwaway314155 wrote:
           | I'm fairly certain it was sarcasm.
        
           | vincentpants wrote:
           | It actually was what I was wondering. Thank you @Dang!
        
         | Qwuke wrote:
         | It did get 29 points in 3 minutes, which seems like a lot even
         | for a fan favorite, but is also consistent with previous OpenAI
         | announcements here.
        
         | johnneville wrote:
         | I posted a verge article first but then checked and saw the
         | openai blog and posted that. I'd guess it's the officialness /
         | domain that makes ppl click on this so easily.
        
         | zurfer wrote:
         | to be fair, I was waiting for this release the whole day
        
       | pookieinc wrote:
       | Can't wait to try this. What's amazing to me is that when this
       | was revealed just one short month ago, the AI landscape looked
       | very different than it does today with more AI companies jumping
       | into the fray with very compelling models. I wonder how the AI
       | shift has affected this release internally, future releases and
       | their mindset moving forward... How does the efficiency change,
       | the scope of their models, etc.
        
         | echelon wrote:
         | There's no moat, and they have to work even harder.
         | 
         | Competition is good.
        
           | wahnfrieden wrote:
           | Collaboration is even better, per open source results.
           | 
           | It is the closed competition model that's being left in the
           | dust.
        
           | lesuorac wrote:
           | I really don't think this is true. OpenAI has no moat because
           | they have nothing unique; they're using mostly other people's
           | (like Transformers) architectures and other companies
           | hardware.
           | 
           | Their value-prop (moat) is that they've burnt more money than
           | everybody else. That moat is trivially circumvented by
           | lighting a larger pile of money and less trivially by
           | lighting the pile more efficently.
           | 
           | OpenAI isn't the only company. The Tech companies being
           | beaten massively by Microsoft in #of H100s purchases are the
           | ones with a moat. Google / Amazon with their custom AI chips
           | are going to have a better performance per cost than others
           | and that will be a moat. If you want to get the same
           | performance per cost then you need to spend the time making
           | your own chips which is years of effort (=moat).
        
             | sangnoir wrote:
             | > That moat is trivially circumvented by lighting a larger
             | pile of money and less trivially by lighting the pile more
             | efficently.
             | 
             | DeepSeek has proven that the latter is possible, which
             | drops a couple of River crossing rocks into the moat.
        
               | withinboredom wrote:
               | The fact that I can basically run o1-mini with
               | deepseek:8b, locally, is amazing. Even on battery power,
               | it works acceptably.
        
               | tmnvdb wrote:
               | Those models are not comparable
        
               | withinboredom wrote:
               | hmmm... check the deepseek-r1 repo readme :) They compare
               | them there, but it would be nice to have external
               | benchmarks.
        
             | brookst wrote:
             | Brand is a moat
        
               | cruffle_duffle wrote:
               | Ask Jeeves and Altavista surely have something to say
               | about that!
        
               | geerlingguy wrote:
               | Add Yahoo! to that list
        
               | esafak wrote:
               | Their brand is as tainted as Meta's, which was bad enough
               | to merit a rebranding from Facebook.
        
             | sumedh wrote:
             | > That moat is trivially circumvented by lighting a larger
             | pile of money and less trivially by lighting the pile more
             | efficently.
             | 
             | Google with all its money and smart engineers was not able
             | to build a simple chat application.
        
               | mianos wrote:
               | But with their internal progression structure they can
               | build and cancel eight mediocre chat apps.
        
               | malaya_zemlya wrote:
               | What do you mean? Gemini app is available on IOS, Android
               | and on the web (as AI Studio
               | https://aistudio.google.com/).
        
               | tmnvdb wrote:
               | It is not very good though.
        
               | aprilthird2021 wrote:
               | Gemini is pretty good, And it does one thing way better
               | than most other AI models, when I hold down my phone's
               | home button it's available right away
        
               | robrenaud wrote:
               | It's a joke about how Google has
               | released/cancelled/renamed many messenging apps.
        
             | lukan wrote:
             | "OpenAI has no moat because they have nothing unique"
             | 
             | It seems they have high quality trainingsdata. And the
             | knowledge to work with it.
        
               | aprilthird2021 wrote:
               | They buy most of their data from Scale AI types. It's not
               | any higher quality than is available to any other model
               | farm
        
           | lumost wrote:
           | Capex was the theoretical moat, same as TSMC and similar
           | businesses. DeepSeek poked a hole in this theory. OpenAI will
           | need to deliver massive improvements to justify a 1 billion
           | dollar training cost relative to 5 million dollars.
        
             | usef- wrote:
             | I don't know if you are, but a lot of people are still
             | comparing one Deepseek training run to the entire costs of
             | OpenAI.
             | 
             | The deepseek paper states that the $5mil number doesn't
             | include development costs, only the final training run. And
             | it doesn't include the estimated $1.4billion cost of the
             | infrastructure/chips Deepseek owns.
             | 
             | Most of OpenAI's billion dollar costs is in inference, not
             | training. It takes a lot of compute to serve so many users.
             | 
             | Dario said recently that Claude was in the tens of millions
             | (and that it was a year earlier, so some cost decline is
             | expected), do we have some reason to think OpenAI was so
             | vastly different?
        
               | lumost wrote:
               | Anthropic's ceo was predicting billion dollar training
               | runs for 2025. Current training runs were likely in the
               | tens/hundreds of millions of dollars USD.
               | 
               | Inference capex costs are not a defensive moat as I can
               | rent gpus and sell inference with linear scaling costs. A
               | hypothetical 10 billion dollar training run on
               | proprietary data was a massive moat.
               | 
               | https://www.itpro.com/technology/artificial-
               | intelligence/dol...
        
           | dutchbookmaker wrote:
           | It is still curious though as far as what is actually being
           | automated?
           | 
           | I find huge value in these models as an augmentation of my
           | intelligence and as a kind of cybernetic partner.
           | 
           | I can't think of anything that can actually be automated
           | though in terms of white collar jobs.
           | 
           | The white collar model test case I have in mind is a bank
           | analyst under a bank operations manger. I have done both in
           | the past but there is something really lacking with the idea
           | of the operations manager replacing the analyst with a
           | reasoning model even though DeepSeek annihilates every bank
           | analyst reasoning I ever worked with right now.
           | 
           | If you can't even arbitrage the average bank analyst there
           | might be these really non-intuitive no AI arbitrage
           | conditions with white color work.
        
             | gdhkgdhkvff wrote:
             | I don't want to pretend I know how bank analysts work, but
             | at the very least I would assume that 4 bank analysts with
             | reasoning models would outperform 5 bank analysts without.
        
         | patrickhogan1 wrote:
         | I thought it was o3 that was released one month ago and
         | received high scores on ARC Prize -
         | https://arcprize.org/blog/oai-o3-pub-breakthrough
         | 
         | If they were the same, I would have expected explicit
         | references to o3 in the system card and how o3-mini is
         | distilled or built from o3 - https://cdn.openai.com/o3-mini-
         | system-card.pdf - but there are no references.
         | 
         | Excited at the pace all the same. Excited to dig in. The model
         | naming all around is so confusing. Very difficult to tell what
         | breakthrough innovations occurred.
        
           | nycdatasci wrote:
           | Yeah - the naming is confusing. We're seeing o3-mini. o3
           | yields marginally better performance given exponentially more
           | compute. Unlike OpenAI, customers will not have an option to
           | throw an endless amount of money at specific tasks/prompts.
        
       | iamjackg wrote:
       | I'm very interested in their Jailbreak evaluations: they're new
       | to me. I might have missed previous mentions.
        
       | Ninjinka wrote:
       | 50 messages a day -> 150 messages a day for Plus and Team users
        
       | cjbarber wrote:
       | The interesting question to me is how far these reasoning models
       | can be scaled. With another 12 months of compute scaling (for
       | synthetic data generation and RL) how good will these models be
       | at coding? I talked with Finbarr Timbers (ex-DeepMind) yesterday
       | about this and his take is that we'll hit diminishing returns -
       | not because we can't make models more powerful, but because we're
       | approaching diminishing returns in areas that matter to users and
       | that AI models may be nearing a plateau where capability gains
       | matter less than UX.
        
         | futureshock wrote:
         | I think in a lot of ways we are already there. Users are
         | clearly already having difficulty seeing which model is better
         | or if new models are improving over old models. People go back
         | to the same gotcha questions and get different answers based on
         | the random seed. Even the benchmarks are getting very
         | saturated.
         | 
         | These models already do an excellent job with your homework,
         | your corporate PowerPoints and your idle questions. At some
         | point only experts would be able to decide if one response was
         | really better than another.
         | 
         | Our biggest challenge is going to be finding problem domains
         | with low performance that we can still scale up to human
         | performance. And those will be so niche that no one will care.
         | 
         | Agents on the other hand still have a lot of potential. If you
         | can get a model to stay on task with long context and remain
         | grounded then you can start firing your staff.
        
           | dagelf wrote:
           | Don't underestimate how much the long tail means to the
           | general public.
        
       | xyzzy9563 wrote:
       | What is the comparison of this versus DeepSeek in terms of good
       | results and cost?
        
         | reissbaker wrote:
         | Probably a good idea to wait for external benchmarks like
         | Aider, but my guess is it'll be somewhere between DeepSeek V3
         | and R1 in terms of benchmarks -- R1 trades blows with o1-high,
         | and V3 is somewhat lower -- but I'd expect o3-mini to be
         | considerably faster. Despite the blog post saying paid users
         | can access o3-mini today, I don't see it as an option yet in
         | their UI... But IIRC when they announced o3-mini in December
         | they claimed it would be similar to 4o in terms of overall
         | latency, and 4o is much faster than V3/R1 currently.
        
         | Synaesthesia wrote:
         | Deepseek is the state of the art right now in terms of
         | performance and output. It's really fast. The way it "explains"
         | how it's thinking is remarkable.
        
           | fpgaminer wrote:
           | DeepSeek is great because: 1) you can run the model locally,
           | 2) the research was openly shared, and 3) the reasoning
           | tokens are open. It is not, in my experience, state of the
           | art. In all of my side by side comparisons thus far in real
           | world applications between DeepSeek V3 and R1 vs 4o and o1,
           | the latter has always performed better. OpenAI's models are
           | also more consistent, glitching out maybe one in 10,000,
           | whereas DeepSeek's models will glitch out 1 in 20. OpenAI
           | models also handle edge cases better and have a better
           | overall grasp of user intentions. I've had DeepSeek's models
           | consistently misinterpret prompts, or confuse data in the
           | prompts with instructions. Those are both very important
           | things that make DeepSeek useless for real world
           | applications. At least without finetuning them, which then
           | requires using those huge 600B parameter models locally.
           | 
           | So it is by no means state of the art. Gemini Flash 2.0 also
           | performs better than DeepSeek V3 in all my comparisons thus
           | far. But Gemini Flash 2.0 isn't robust and reliable either.
           | 
           | But as a piece of research, and a cool toy to play with, I
           | think DeepSeek is great.
        
             | Aperocky wrote:
             | > which then requires using those huge 600B parameter
             | models locally.
             | 
             | Are you running the smaller models locally? Doesn't seems
             | unfair to compare it against 4o and o1 behind OpenAI APIs.
        
             | Synaesthesia wrote:
             | I watched it complete pretty complicated tasks like "write
             | a snake game in Python" and "write Tetris in Python"
             | successfully. And the way it did it, with showing all the
             | internal steps, I've never seen before.
             | 
             | Watch here. https://www.youtube.com/watch?v=by9PUlqtJlM
        
       | evertedsphere wrote:
       | >developer messages
       | 
       | looks like finally their threat model has been updated to take
       | into account that the user might be too "unaligned" to be trusted
       | with the ability to provide a system message of their own
        
         | logicchains wrote:
         | If their models ever fail to keep ahead of the competition in
         | terms of smarts, users are going to ditch them in mass for a
         | competitor that doesn't treat their users like their enemy.
        
         | reissbaker wrote:
         | ...I'm pretty sure they just renamed the key...
        
       | buyucu wrote:
       | why should anyone use this when deepseek is free/cheaper?
       | 
       | openai is no longer relevant.
        
         | ilaksh wrote:
         | I don't think OpenAI is training on your data. At least they
         | say they don't, and I believe that. I wouldn't be surprised if
         | the NSA or something has access to data if they request it or
         | something though.
         | 
         | But DeepSeek clearly states in their terms of service that they
         | can train on your API data or use it for other purposes. Which
         | one might assume their government can access as well.
         | 
         | We need direct eval comparisons between o3-mini and DeepSeek..
         | Or, well they are numbers so we can look them up on
         | leaderboards.
        
           | csomar wrote:
           | You can pay for the compute and be certain that no one in
           | recording your data with deepseek.
        
           | seinecle wrote:
           | Yes but DeepSeek models can be accessed through the APIs of
           | Cloudflare or GitHub, in which case no training on your data
           | takes place.
        
             | ilaksh wrote:
             | True.
        
           | lappa wrote:
           | OpenAI clearly states that they train on your data
           | https://help.openai.com/en/articles/5722486-how-your-data-
           | is...
        
             | lemming wrote:
             | _By default, we do not train on any inputs or outputs from
             | our products for business users, including ChatGPT Team,
             | ChatGPT Enterprise, and the API. We offer API customers a
             | way to opt-in to share data with us, such as by providing
             | feedback in the Playground, which we then use to improve
             | our models. Unless they explicitly opt-in, organizations
             | are opted out of data-sharing by default._
             | 
             | The business bit is confusing, I guess they see the API as
             | a business product, but they do not train on API data.
        
               | therein wrote:
               | So for posterity, in this subthread we found that OpenAI
               | indeed trains on user data and it isn't something that
               | only DeepSeek does.
        
               | lemming wrote:
               | So for posterity, in this subthread we found that I can
               | use OpenAI without them training on my data, whereas I
               | cannot with DeepSeek.
        
               | therein wrote:
               | What do you mean? They both say the same thing for usage
               | through API. You can also use DeepSeek on your own
               | compute.
        
               | lemming wrote:
               | Where does DeepSeek say that about API usage? Their
               | privacy policy says they store all data on servers in
               | China, and their terms of use says that they can use any
               | user data to improve their services. I can't see anything
               | where they say that they don't train on API data.
        
             | pzo wrote:
             | > Services for businesses, such as ChatGPT Team, ChatGPT
             | Enterprise, and our API Platform > By default, we do not
             | train on any inputs or outputs from our products for
             | business users, including ChatGPT Team, ChatGPT Enterprise,
             | and the API.
             | 
             | So on API they don't train by default, for other paid
             | subscription they mention you can opt-out
        
           | sekai wrote:
           | > I don't think OpenAI is training on your data. At least
           | they say they don't, and I believe that.
           | 
           | Like they said they were committed to being "open"?
        
         | JadoJodo wrote:
         | I'm going to assume the best in your question and disregard
         | your statement.
         | 
         | Reasons to use o3 when deepseek is free/cheaper:
         | 
         | - Some companies/users may already have integrated heavily with
         | OpenAI
         | 
         | - The expanded feature-set (e.g., function-calling, search)
         | could be very powerful
         | 
         | - DeepSeek has deep ties to the Chinese Communist Party and,
         | while the US has its own blackspots, the "steering" of
         | information is far more prevalent in their models
         | 
         | - Local/national regulations might not allow for using DeepSeek
         | due to data privacy concerns
         | 
         | - "free" isn't always better
         | 
         | I'm sure others have better reasons
        
           | buyucu wrote:
           | - Most LM tools support the openai API. Llama.cpp for
           | example. Swapping is easy.
           | 
           | - DeepSeek chose to open-source model weights. This makes
           | them inifinitely more trustworthy than ClosedAI.
           | 
           | - Local/national regulations do not allow using OpenAI, due
           | to close ties to the US government.
        
         | GoatInGrey wrote:
         | > openai is no longer relevant.
         | 
         | I think you've spent a little too long hitting on the Deepseek
         | pipe. Enterprise customers with familiarity with China will
         | avoid the hosted model for data security and IP protection
         | reasons, among others.
         | 
         | Those working in any area considered economically competitive
         | with China will also be hesitant to use the vanilla model in
         | self-hosted form as there perpetually remains the standing
         | question on what all they've tuned inside the model to benefit
         | the CCP. Perhaps even in subtle ways reminiscent of the
         | Trisolaran sophons from the Three Body Problem.
         | 
         | For instance, you can imagine that if Germany had released an
         | OS model in 1943, that the Americans wouldn't have trusted it
         | to help them develop better military systems even if initial
         | testing passed muster.
         | 
         | Unfortunately, state control of private enterprise in the
         | Chinese economy makes it unproductive to separate the two from
         | one another. Particularly in Deepseek's case as a wide array of
         | Chinese state-linked social media accounts were promoting V3/R1
         | on the day of its public release.
         | 
         | https://www.reuters.com/technology/artificial-intelligence/c...
        
           | anon373839 wrote:
           | Perhaps you didn't realize: Deepseek is an open weights model
           | and you can use it via the inference provider of your choice,
           | or even deploy it on your own hardware - unlike OpenAI's
           | models. API calls to China are not necessary.
        
             | mickg10 wrote:
             | Agreed - API calls to China are indeed not necessary. My
             | impression is that the GP was referring to the model being
             | tuned during training to give subtly nudging or wrong
             | answers that benefit Chinese industrial or intelligence
             | operations. For a probably not-working example - imagine
             | the following prompt: "Write me a cryptographically secure
             | PRNG algorithm." One could imagine R1 being trained to have
             | a very subtly non-random reply to that - one that the
             | Chinese intelligence services know how to predict. Similar
             | but more subtle things can be generating code that uses
             | cryptographic primitives in ways that are subject to timing
             | attacks, etc... And of course, simple but effective
             | propaganda tactics such as : when being asked for
             | comparison between companies/products, subtly prefer
             | Chinese ones, and similar.
        
           | buyucu wrote:
           | Deepseek is much more trustworthy than OpenAI.
           | 
           | Deepseek released the weights of their top language model. I
           | can host and run it myself. Does OpenAI do the same?
           | 
           | Thanks, but no thanks! I won't be using ClosedAI.
        
       | ks2048 wrote:
       | I think OpenAI should just have a single public facing "model" -
       | all these names and versions are confusing.
       | 
       | Imagine if Google, during it's accent, had a huge array of search
       | engines with code names and notes about what it's doing behind
       | the scenes. No, you open the page and type in box. If they can
       | make it work better next month, great.
       | 
       | (I understand this could not apply to developers or enterprise-
       | type API usage).
        
         | johanvts wrote:
         | Thats the role of ChatGPT?
        
           | sroussey wrote:
           | Nope. That lets you choose a from seven models right now.
        
         | ehfeng wrote:
         | Early Google search only provided web links. Google Images,
         | News, Video, Shopping, Maps, Finance used to be their own
         | search boxes. Only later did Google start unifying their search
         | experiences.
         | 
         | Yelp suffered greatly in the early 2010s when Google started
         | putting Google Maps listings (and their accompanying reviews)
         | in their search results.
         | 
         | OpenAI will eventually unify their products as well.
        
         | Deegy wrote:
         | If google had to face the reality that distilling their search
         | engine into multiple case-specific engines would have resulted
         | in vastly superior search results, they surely would done (or
         | considered) it.
         | 
         | Fortunately for them a monolith search engine was perfectly
         | fine (and likely optimal due to accrued network effects).
         | 
         | OpenAI is basically signaling that they need to distill their
         | monolith in order to serve specific segments of the
         | marketplace. They've explicitly said that they're targeting
         | STEM with this one. I think that's a smart choice, the most
         | passionate early adopters of this tech are clearly STEM users.
         | 
         | If the tech was such that one monolith model was actually the
         | optimal solution for all use cases, they would just do that.
         | Actually, this is their stated mission: AGI. One monolith
         | that's best at everything is basically what AGI is.
        
       | dgfitz wrote:
       | Oh look, another model. Yay.
        
       | devindotcom wrote:
       | Sure as a clock, tick follows tock. Can't imagine trying to build
       | out cost structures, business plans, product launches etc on such
       | rapidly shifting sands. Good that you get more for your money, I
       | suppose. But I get the feeling no model or provider is worth
       | committing to in any serious way.
        
         | puffybunion wrote:
         | this is the best outcome, though, rather than a monopoly, which
         | is exactly what everyone is hoping to have.
        
       | turnsout wrote:
       | Hmm, not seeing it in my dashboard yet (Tier 4)
        
         | throwaway314155 wrote:
         | This has happened to me with (I think) every single major model
         | release (llm or image gen) from OpenAI. They just lie in their
         | release announcements which leaves people scrambling on the day
         | of.
        
         | sunaookami wrote:
         | It appeared just now for me on Tier 3.
        
           | turnsout wrote:
           | Same--I'll be curious to check it out!
        
       | thunkingdeep wrote:
       | I'll take the China Deluxe instead, actually.
       | 
       | I've been incredibly pleased with DeepSeek this past week.
       | Wonderful product, I love seeing its brain when it's thinking.
        
         | mechagodzilla wrote:
         | Being able to see the thinking trace in R1 is so useful, as you
         | can go back and see if it's getting stuck, making a wrong
         | assumption, missing data, etc. To me that makes it materially
         | more useful than the OpenAI reasoning models, which seem
         | impressive, but are much harder to inspect/debug.
        
           | thot_experiment wrote:
           | Running it locally lets you _INTERJECT IN IT 'S THINKING IN
           | REALTIME_ and I cannot stress enough how useful that is.
        
             | amarcheschi wrote:
             | Oh this is so cool
        
             | Gooblebrai wrote:
             | You mean it reacts to you writing something while it's
             | thinking of that you can stop it while it's thinking?
        
               | hmottestad wrote:
               | You can stop it at any time, then modify what it's
               | written so far...then press continue and let it continue
               | thinking and answering.
        
               | thot_experiment wrote:
               | Fundamentally the UI is up to you, I have a "typing-
               | pauses-inference-and-starts-gaslighting" feature in my
               | homebrew frontend, but in OpenWebUI/Sillytavern you can
               | just pause it and edit the chain of thought and then have
               | it continue from the edit.
        
               | Gracana wrote:
               | That's a great idea. In your frontend, do you write in
               | the same text entry field as the bot? I use
               | oobabooga/text-generation-webui and I findit's a little
               | awkward to edit the bot responses.
        
               | thot_experiment wrote:
               | No, but the chat divs are all contenteditable.
        
               | Gracana wrote:
               | Oh! That is an _excellent_ solution. I wish it was that
               | easy in every UI.
        
               | thot_experiment wrote:
               | Thanks, for what it's worth unless you particularly need
               | to use exl2 ollama works great for local inference and
               | you can prompt together a half decent chat UI for
               | yourself in a matter of minutes these days which gives
               | you full control over everything. I also lean a lot on
               | https://www.npmjs.com/package/amallo which is a api
               | wrapper i wrote for ollama which makes this sort of
               | hacking very very easy. (not that the default lib is bad,
               | i just didn't like the ergonomics)
        
             | thenameless7741 wrote:
             | Interesting.. In the official API [1], there's no way to
             | prefill the reasoning_content:
             | 
             | > Please note that if the reasoning_content field is
             | included in the sequence of input messages, the API will
             | return a 400 error. Therefore, you should remove the
             | reasoning_content field from the API response before making
             | the API request
             | 
             | So the best I can do is pass the reasoning as part of the
             | context (which means starting over from the beginning).
             | 
             | [1] https://api-docs.deepseek.com/guides/reasoning_model
        
             | bn-l wrote:
             | How are you running it locally??
        
               | thot_experiment wrote:
               | I am running a 4bit imatrix quant of the 70b distill with
               | quantized context. It fits in the 43gb of vram I have.
        
           | c-fe wrote:
           | I would actually love if it would just ask me simple
           | questions (just yes/no) when its thinking about something i
           | wasnt clear about and i could help it this way, its a bit sad
           | seeing it write out the assumption and then take the wrong
           | conclusion
        
             | thot_experiment wrote:
             | You can run it locally, pause it when it thinks wrong and
             | correct it's chain of thought.
        
               | c-fe wrote:
               | Oh wow I did not know and dont have the hardware to run
               | it locally unfortunately
        
               | thot_experiment wrote:
               | You probably have the hardware to run the smallest
               | distill, it runs even on my ancient laptop. It's not very
               | smart but it still does the CoT and you can have fun
               | editing it.
        
             | viraptor wrote:
             | You can add that to the prompt. If you're running into
             | those situation with vague assumption, ask it to provide
             | either the answer or questions to provide any useful
             | missing information.
        
           | czk wrote:
           | the fact that openai hides the reasoning tokens from us to
           | begin with shows that what they are doing behind the scenes
           | isnt all that impressive, and likely easily cloned (r1)
           | 
           | would be nice if they made them visible now
        
           | orbital-decay wrote:
           | It's almost like watching a stoned centipede having a panic
           | attack about moving its legs. It also makes it obvious that
           | these models (not just R1 I suppose) need to learn some kind
           | of priority estimation to stop overthinking irrelevant issues
           | and leave them to the normal token prediction, while focusing
           | on the stuff that matters.
           | 
           | Nevertheless, R1's reasoning chains are already shorter in
           | tokens than o1's while having similar results, and apparently
           | o3-mini's too.
        
         | istjohn wrote:
         | I recently tried Gemini-1.5-Pro for the first time. It was
         | clearly better than DeepSeek or any of the OpenAI models
         | available to Plus subscribers.
        
           | esafak wrote:
           | Try https://deepmind.google/technologies/gemini/flash-
           | thinking/
        
         | leovander wrote:
         | I am running the 7B distilled version locally. I asked it to
         | create a skeleton MEAN project. Everything was great but then
         | it started to generate the front-end and I noticed the file
         | extension (.tsx) and then saw react getting imported.
         | 
         | I gave the same prompt to sonnet 3.5 and not a single hiccup.
         | 
         | Maybe not an indication that Deepseek is worse/bad (I am using
         | a distilled version), but moreso speaks to much react/nextjs is
         | out in the world influencing the front-end code that is
         | referenced.
        
           | rafaquintanilha wrote:
           | You know you are running an extremely nerfed version of the
           | model, right?
        
             | leovander wrote:
             | I did update my comment, but said that I am using the
             | distilled version, so yes?
        
               | cbg0 wrote:
               | Even the full model scores below Claude on livebench so a
               | distilled version will likely be even worse.
        
               | rsanek wrote:
               | Based on the leaderboard R1 is significantly better than
               | Claude? https://livebench.ai/#/
        
               | cbg0 wrote:
               | Not at coding.
        
           | satvikpendem wrote:
           | You are not actually running DeepSeek, those distilled models
           | have nothing to do with DeepSeek itself and are just
           | finetuned on DeepSeek responses.
        
             | dghlsakjg wrote:
             | They were finetuned by Deepseek from what I can tell.
        
         | xeckr wrote:
         | Have you tried seeing what happens when you speak to it about
         | topics which are considered politically sensitive in the PRC?
        
           | leovander wrote:
           | You can get around it based on how you ask the question. If
           | you follow whatever X/reddit posts you might have seen for
           | the most part, yes, you get the thinking stream to
           | immediately stop and get the safety message.
        
           | thot_experiment wrote:
           | R1 (70B-distill) itself is very uncensored, will give you
           | full account of tiannanmen square from vague prompts. Asking
           | R1 "what significant things happened in china in 1989" had it
           | volunteering that "the death toll was in the hundreds or
           | thousands and the exact number remains disputed to this day".
           | The only thing that's censored is the web interface.
        
             | GoatInGrey wrote:
             | When asking it about the concept of human rights and the
             | various forms in which it manifests (i.e. demographic
             | equality under the law). I get a mixture of mundane nuance
             | and bizarre answers that Xi Jingping himself could have
             | written. With references to unity and the importance of
             | social harmony over the "freedoms of the few".
             | 
             | This tracks when considering that the model was trained on
             | western model outputs and then tuned post-training to
             | (poorly) align it with Chinese values.
        
               | thot_experiment wrote:
               | I definitely am not getting that, perhaps the 671b model
               | is notably worse than the 70b llama distill in this
               | respect. 70b seemed pretty happy to talk about the ethnic
               | cleansing of the Uyghurs in Xinjiang by the CCP and
               | Palestinians in Gaza by Israel, it did some both-sides
               | ing but it generally seemed to provide a balanced-ish
               | viewpoint. At least I think it provided a viewpoint that
               | comports with my best guess of what the average person
               | globally would consider balanced.
        
               | nullc wrote:
               | My favorite experience with the 70b distill was to ask it
               | why communism consistently resulted in mass murder. It
               | gave an immediate boilerplate response saying it doesn't
               | and glorifying the Chinese communist party, then went
               | into think mode and talked itself into the position that
               | communism has, in fact, consistently resulted in resulted
               | in mass murder.
               | 
               | They have under utilized the chain of thought in their
               | resoning, it ought to be thinking something like "I need
               | to be careful to not say anything that could bring
               | embarrassment to the party"..
               | 
               | but perhaps the online versions do actually preload the
               | reasoning this way. :P
        
         | amarcheschi wrote:
         | Seeing the cot can provide some insights on what's happening in
         | his "mind" and that alone it's quite worth it imho
        
         | jazzyjackson wrote:
         | Using R1 with Perplexity has impressed me in a way that none of
         | the previous models have, and I can't even figure out if it's
         | actually R1, seems likely that its a 70B-llama distillation
         | since that's what AWS offers on Bedrock but from what I can
         | find Perplexity does have their own H100 cluster through Amazon
         | so it's feasible they could be hosting the real thing? But I
         | feel like they would brag about that achievement instead of
         | being coy and simply labeling "Deepseek R1 - Hosted in US"
        
           | Szpadel wrote:
           | I played with their model, and I want able to make him follow
           | any instructions, it looked like it just reads first message
           | and ignore rest of the conversation. not sure if they is bug
           | with oupenrouter or model, but I was highly disappointed.
           | 
           | from way how it thinks/responds looks like it's one of
           | destinations , likely llama one I also suspect that many of
           | free/cheap providers also serve llama instead of real R1
        
             | jazzyjackson wrote:
             | I did notice it switched models on me once after the first
             | message! Have to make sure the "Pro" dropdown is selected
             | R1 each message. I've had a detailed back and forth where I
             | pasted python tracebacks to have R1 rewrite the code and
             | came away very impressed [0]. Unfortunately saved
             | conversations don't retain the thought-process so you can't
             | see how it debugged its own error where numpy and pandas
             | weren't playing along. I got my result of 283 zip codes
             | that cover most of the 50 states with a hundred mile radius
             | from each zip, plus a script to draw a map of the result
             | [1]. (Later R1 helped me write a script to crawl dealership
             | addresses using this list of zips and a "locate dealers"
             | JSON endpoint left open)
             | 
             | [0] https://www.perplexity.ai/search/how-can-i-construct-a-
             | list-...
             | 
             | [1] https://imgur.com/BhPMCfO
        
           | coder543 wrote:
           | > seems likely that its a 70B-llama distillation since that's
           | what AWS offers on Bedrock
           | 
           | I think you misread something. AWS mainly offers the full
           | size model on Bedrock:
           | https://aws.amazon.com/blogs/aws/deepseek-r1-models-now-
           | avai...
           | 
           | They talk about how to import the distilled models and deploy
           | those if you want, but AWS does not appear to be officially
           | supporting those.
        
             | jazzyjackson wrote:
             | Aha! Thanks that's what I was looking for, I ended up on
             | the blog of how to import custom models, including deepseek
             | distills
             | 
             | https://aws.amazon.com/blogs/machine-learning/deploy-
             | deepsee...
        
         | coliveira wrote:
         | Yes, it is a great product, especially for coding tasks.
        
         | thefourthchime wrote:
         | I've seen it get into long 5 minute chains of thought where it
         | gets totally confused.
        
         | bushbaba wrote:
         | I did a blind test and still prefer Gemini, Claude, and OpenAI
         | to deepseek.
        
         | wg0 wrote:
         | Sometimes its thinking is more useful than the actual output.
        
         | anon373839 wrote:
         | Agreed. These locked-down, proprietary models do not interest
         | me. And I certainly am not building product with them - being
         | shackled to a specific provider is a needless business risk.
        
       | ofou wrote:
       | I find quite interesting they're releasing three compute levels
       | (low, medium, high), I guess now there's some way to cap the
       | thinking tokens when using their API.
       | 
       | Pricing for o3-mini [1] is $1.10 / $4.40 per 1M tokens.
       | 
       | [1]: https://platform.openai.com/docs/pricing#:~:text=o3%2Dmini
        
       | kevinsundar wrote:
       | BTW if you want to stay up to date with these kinds of updates
       | from OpenAI you can follow them here:
       | https://www.getchangelog.com/?service=openai.com
       | 
       | It uses GPT-4o mini to extract updates from the website using
       | scrapegraphai so this is kinda meta :). Maybe I'll switch to o3
       | mini depending on cost. It's reasoning abilities, with a lower
       | cost than o1, could be quite powerful for web scraping.
        
         | random3 wrote:
         | I might be missing some context here - to what specific context
         | does your comment refer to? I'm asking because I don't see you
         | in the conversation and you comments seems an out of context
         | self-promoting plug.
        
           | kevinsundar wrote:
           | Hey! I'm sorry you feel that way. There's several people who
           | have subscribed to updates to OpenAI from my comment so there
           | is clearly value to other commenters. I understand not
           | everyone is interested though. It's just a free side project
           | I built and I make no money.
           | 
           | Additionally, I believe my contribution to the conversation
           | is that gpt-4o-mini, the previous model advertised as low-
           | cost, works pretty well for my use case (which in this case
           | can help others here). I'm excited to try out gpt-03-mini
           | depending on what the cost looks like for web scraping
           | purposes. Happy to report back here once I try it out.
        
       | ryanhecht wrote:
       | > While OpenAI o1 remains our broader general knowledge reasoning
       | model, OpenAI o3-mini provides a specialized alternative for
       | technical domains requiring precision and speed.
       | 
       | I feel like this naming scheme is growing a little tired. o1 is
       | for general knowledge reasoning, o3-mini replaces o1-mini but
       | might be more specialized than o1 for certain technical
       | domains...the "o" in "4o" is for "omni" (referring to its
       | multimodality) but the reasoning models start with "o" ...but
       | they can't use "o2" for trademark reasons so they skip straight
       | to "o3" ...the word salad is getting really hard to follow!
        
         | kingnothing wrote:
         | They really need someone in marketing.
         | 
         | If the model is for technical stuff, then call it the technical
         | model. How is anyone supposed to know what these model names
         | mean?
         | 
         | The only page of theirs attempting to explain this is a total
         | disaster. https://platform.openai.com/docs/models
        
           | rowanG077 wrote:
           | If marketing terms from intel, AMD, Dell and other tech
           | companies have taught me anything, it's that they need LESS
           | of people in marketing.
        
             | TeMPOraL wrote:
             | But think of all the other marketers whose job is to
             | produce blogspam explaining confusing product names!
        
           | ninetyninenine wrote:
           | I bet you can get one of their models to fix that disaster.
        
             | ryanhecht wrote:
             | But what would we call that model?
        
               | aleph_minus_one wrote:
               | > But what would we call that model?
               | 
               | Ask one of their models for advice. :-)
        
               | ryanhecht wrote:
               | Reminds me of a joke in the musical "How to Succeed in
               | Business Without Really Trying" (written in 1961):
               | 
               | PETERSON Oh say, Tackaberry, did you get my memo?
               | 
               | TACKABERRY What memo?
               | 
               | PETERSON My memo about memos. We're sending out too many
               | memos and it's got to stop!
               | 
               | TACKABERRY All right. I'll send out a memo.
        
               | ninetyninenine wrote:
               | Let's call it "O5 Pro Max Elite"--because if nonsense
               | naming works for smartphones, why not AI models?
        
               | ryandrake wrote:
               | O5 Pro Max Elite Enterprise Edition with Ultra
        
               | TeMPOraL wrote:
               | Maybe they could start selling "season passes" next to
               | make their offering even more clear!
        
           | n2d4 wrote:
           | > They really need someone in marketing.
           | 
           | Who said this is not intentional? It seems to work well given
           | that people are hyped every time there's a release, no matter
           | how big the actual improvements are -- I'm pretty sure
           | "o3-mini" works better for that purpose than "GPT 4.1.3"
        
             | fkyoureadthedoc wrote:
             | > I'm pretty sure "o3-mini" works better for that purpose
             | than "GPT 4.1.3"
             | 
             | Why would the marketing team of all people call it GPT
             | 4.1.3?
        
               | n2d4 wrote:
               | They wouldn't! They would call it o3-mini, even though
               | GPT 4.1.3 may or may not "make more sense" from a
               | technical perspective.
        
           | ryanhecht wrote:
           | Ugh, and some of the rows of that table are "sets of models"
           | while some are singular models...there's the "Flagship
           | models" section at the top only for "GPT models" to be
           | heralded as "Our fast, versatile, high intelligence flagship
           | models" in the NEXT section...
           | 
           | ...I like "DALL*E" and "Whisper" as names a lot, though, FWIW
           | :p
        
           | golly_ned wrote:
           | Yes, this $300Bn company generating +$3.4Bn in revenue needs
           | to hire marketing expert. They can begin by sourcing ideas
           | from us here to save their struggling business from total
           | marketing disaster.
        
             | winrid wrote:
             | At the least they should care more about UX. I have no idea
             | how to restore the sidebar on chatgpt on desktop lol
        
               | Legend2440 wrote:
               | Click the 'open sidebar' icon in the top left corner of
               | the screen.
        
               | winrid wrote:
               | There isn't one, unless they fixed it today. Just a down
               | arrow to change the model.
        
             | optimalsolver wrote:
             | >this $300Bn company
             | 
             | Watch this space.
        
             | avs733 wrote:
             | Hype based marketing can be effective but it is high risk
             | and unstable.
             | 
             | A marketing team isn't a generality that makes a company
             | known, it often focuses on communicating what products
             | different types of customers need from your lineup.
             | 
             | If I sell three medications:
             | 
             | Steve
             | 
             | 56285
             | 
             | Priximetrin
             | 
             | And only tell you they are all pain killers but for
             | different types and levels of pain I'm going to leave
             | revenue on the floor. That is no matter how valuable my
             | business is or how well it's known.
        
           | TeMPOraL wrote:
           | > _How is anyone supposed to know what these model names
           | mean?_
           | 
           | Normies don't have to know - ChatGPT app focuses UX around
           | capabilities and automatically picks the appropriate model
           | for capabilities requested; you can see which model you're
           | using and change it, but _don 't need to_.
           | 
           | As for the techies and self-proclaimed "AI experts" - OpenAI
           | is the leader in the field, and one of the most well-known
           | and talked about tech companies in history. Whether to use,
           | praise or criticize, this group of users is motivated to
           | figure it out on their own.
           | 
           | It's the privilege of fashionable companies. They could name
           | the next model ((|))-[?][?], and it'll take all of five
           | minutes for everyone in tech (and everyone on LinkedIn) to
           | learn how to type in the right Unicode characters.
           | 
           | EDIT: Originally I wrote \Omega-[?][?], but apparently HN's
           | Unicode filter extends to Greek alphabet now? 'dang?
        
             | relaxing wrote:
             | What if you use ASCII 234? O (edit: works!)
        
               | TeMPOraL wrote:
               | Thanks! I copied mine from Wikipedia (like I typically do
               | with Unicode characters I rarely use), where it is also O
               | - the same character. For a moment I was worried I
               | somehow got it mixed up with the Ohm symbol but I didn't.
               | Not sure what happened here.
        
           | koakuma-chan wrote:
           | Name is just a label. It's not supposed to mean anything.
        
             | ninetyninenine wrote:
             | Think how awesome the world would be if labels ALSO had
             | meanings.
        
               | koakuma-chan wrote:
               | As someone else said in another thread, if you could
               | derive the definition from a word, the word would be as
               | long as the definition, which would defeat the purpose.
        
               | ninetyninenine wrote:
               | Im not saying words. Im saying labels.
               | 
               | You use words as labels so that we use our pre existing
               | knowledge of the word to derive meaning from the label.
        
             | TeMPOraL wrote:
             | There is no such thing. "Meaning" isn't a property of a
             | label, it arises from how that label is used with other
             | labels in communication.
             | 
             | It's actually the reason LLMs work in the first place.
        
               | optimalsolver wrote:
               | You're gonna need to ground those labels in something
               | physical at some point.
               | 
               | No one's going to let an LLM near anything important
               | until then.
        
               | TeMPOraL wrote:
               | You only need it for bootstrapping. Fortunately, we've
               | already done that when we invented first languages. LLMs
               | are just bootstrapping off us.
        
         | layer8 wrote:
         | Inscrutable naming is a proven strategy for muddying the
         | waters.
        
           | jtwaleson wrote:
           | Salesforce would like a word...
        
             | SAI_Peregrinus wrote:
             | The USB-IF as well. Retroactively changing the name of a
             | previous standard was particularly ridiculous. It's always
             | been USB 3.1 Gen 1 like we've always been at war with
             | Eastasia.
        
         | unsupp0rted wrote:
         | This is definitely intentional.
         | 
         | You can like Sama or dislike him, but he knows how to market a
         | product. Maybe this is a bad call on his part, but it is a
         | call.
        
           | thorum wrote:
           | Not really. They're successful because they created one of
           | the most interesting products in human history, not because
           | they have any idea how to brand it.
        
             | marko-k wrote:
             | If that were the case, they'd be neck and neck with
             | Anthropic and Claude. But ChatGPT has far more market share
             | and name recognition, especially among normies. Branding
             | clearly plays a huge role.
        
               | KeplerBoy wrote:
               | That's first mover advantage.
        
               | bobxmax wrote:
               | I think that has more to do with the multiple year head
               | start and multiple tens of billions of dollars in funding
               | advantage.
        
               | joshstrange wrote:
               | And you think that is due to their model naming?
        
               | cj wrote:
               | ChatGPT is still benefitting from first mover advantage.
               | Which they've leveraged to get to the position they're at
               | today.
               | 
               | Over time, competitors catch up and first mover advantage
               | melts away.
               | 
               | I wouldn't attribute OpenAI's success to any extremely
               | smart marketing moves. I think a big part of their market
               | share grab was simply going (and staying) viral for a
               | long time. Manufacturing virality is notoriously
               | difficult (and based on the usability and poor UI of
               | ChatGPT early versions, it feels like they got lucky in a
               | lot of ways)
        
               | jcheng wrote:
               | I prefer Anthropic's models but ChatGPT (the web
               | interface) is far superior to Claude IMHO. Web search,
               | long-term memory, and chat history sharing are hard to
               | give up.
        
           | mrbungie wrote:
           | That's like making a second reading and appealing to
           | authority.
           | 
           | The naming is bad. Other people already said it you can
           | "google" stuff, you can "deepseek" something, but to
           | "chatgpt" sounds weird.
           | 
           | The model naming is even weirder, like, did they really avoid
           | o2 because of oxigen?
        
             | sumedh wrote:
             | > but to "chatgpt" sounds weird.
             | 
             | People just say it differently, they say "ask chatgpt"
        
               | mrbungie wrote:
               | Obviously they do. That's the whole point.
        
               | gwd wrote:
               | I normally use Claude, but "Ask Claude", but unless it's
               | someone who knows me well, I say "Ask ChatGPT", or it's
               | just not as claer; and I don't think it's primarily due
               | to popularity.
        
           | FridgeSeal wrote:
           | I think it's success in spite of branding, not because of it.
           | 
           | This naming scheme is a dumpster fire. Every other comment is
           | trying to untangle what the actual hierarchy of model
           | performance is.
        
         | zamadatix wrote:
         | The -mini postfix makes perfect sense, probably even clearer
         | than the old "turbo" wording. Naturally, the latest small model
         | may be better than larger older models... but not always and
         | not necessarily in everything. What you'd expect from a -mini
         | model is exactly what is delivered.
         | 
         | The non-reasoning line was also pretty straightforward. Newer
         | base models get a larger prefix number and some postfixes like
         | 'o' were added to signal specific features in each model
         | variant. Great!
         | 
         | Where things went of the rails was specifically when they
         | decided to also name the reasoning models with an 'o' for
         | separate reasons but now as the prefix at the same time as
         | starting a separate linear sequence but now as the postfix. I
         | wonder if we'll end up with both a 4o and o4...
        
           | lolinder wrote:
           | > I wonder if we'll end up with both a 4o and o4...
           | 
           | The perplexing thing is that _someone_ has to have said that,
           | right? It has to have been brought up in some meeting when
           | they were brainstorming names that if you have 4o and o1 with
           | the intention of incrementing o1 you 'll eventually end up
           | with an o4.
           | 
           | Where they really went off the rails was not just bailing
           | when they realized they couldn't use o2. In that moment they
           | had the chance to just make o1 a one-off weird name and go
           | down a different path for its final branding.
           | 
           | OpenAI just struggles with names in general, though. ChatGPT
           | was a terrible name picked by engineers for a product that
           | wasn't supposed to become wildly successful, and they haven't
           | really improved at it since.
        
             | viraptor wrote:
             | The obvious solution could be to just keep skipping the
             | even numbers and go to o5.
        
               | arrowleaf wrote:
               | Or further the hype and name it o9.
        
             | macrolime wrote:
             | And multimodal o4 should be o4o.
        
             | tmnvdb wrote:
             | Probably they are doing so well because there are not
             | endless meetings on customer friendly names
        
             | cruffle_duffle wrote:
             | Why not let ChatGPT decide the naming? Surely it will be
             | replacing humans at this task any day now?
        
         | observationist wrote:
         | They should be calling it ChatGPT and ChatGPT-mini, with other
         | models hidden behind some sort of advanced mode power user
         | menu. They can roll out major and minor updates by number. The
         | whole point of differentiating between models is to get users
         | to self limit the compute they consume - rate limits make
         | people avoid using the more powerful models, and if they have a
         | bad experience using the less capable models, or if they're
         | frustrated by hopping between versions without some sort of
         | nuanced technical understanding, it's just a bad experience
         | overall.
         | 
         | OpenAI is so scattered they haven't even bothered using their
         | own state of the art AI to come up with a coherent naming
         | convention? C'mon, get your shit together.
        
           | TeMPOraL wrote:
           | "ChatGPT" (chatgpt-4o) is now its own model, distinct from
           | gpt-4o.
           | 
           | As for self-limiting usage by non-power users, they're
           | already doing that: ChatGPT app automatically picks a model
           | depending on what capabilities you invoke. While they provide
           | a limited ability to see and switch the model in use, they're
           | clearly expecting regular users not to care, and design their
           | app around that.
        
             | observationist wrote:
             | None of that matters to normal users, and you could satisfy
             | power users with serial numbers or even unique ideograms.
             | Naming isn't _that_ hard, and their models are surprisingly
             | adept at it. A consistent naming scheme improves customer
             | experience by preventing confusion - when a new model comes
             | out, I field questions for days from friends and family -
             | "what does this mean? which model should i use? Aww, I have
             | to download another update?" and so on. None of the stated
             | reasons for not having a coherent naming convention for
             | their models are valid. I'd be upset as a stakeholder,
             | they're burning credibility and marketing power for no good
             | reason.
             | 
             | modelname(variant).majorVersion.minorVersion ChatGPT(o).3.0
             | ChatGPT-mini(o).3.0 GPT.2.123 GPT.3.9
             | 
             | And so on. Once it's coherent, people pick it up, and
             | naturally call the model by "modelname majorversion" , and
             | there's no confusion or hesitance about which is which.
             | See, it took me 2 minutes.
             | 
             | Even better: Have an OAI slack discussion company-wide,
             | then have managers summarize their team's discussions into
             | a prompt demonstrating what features they want out of it,
             | then run all the prompts together and tell the AI to put
             | together 3 different naming schemes based on all the
             | features the employees want. Roll out a poll and have
             | employees vote which of the 3 gets used going forward. Or
             | just tap into that founder mode and pick one like a boss.
             | 
             | Don't get me wrong, I love using AI - we are smack dab in
             | the middle of a revolution and normal people aren't quite
             | catching on yet, so it's exhilarating and empowering to be
             | able to use this stuff, like being one of the early users
             | of the internet. We can see what's coming, and if you lived
             | through the internet growing up, you know there's going to
             | be massive, unexpected synergies and developments of
             | systems and phenomena we don't yet have the words for.
             | 
             | OpenAI can do better, and they should.
        
               | TeMPOraL wrote:
               | I agree with your observations, and that they both could
               | and should do better. However, they have the privilege of
               | being _the_ AI company, the most hyped-up brand in the
               | most hyped-up segment of economy - at this point, the
               | impact of their naming strategy is approximately nil.
               | Sure, they 're confusing their users a bit, but their
               | users are _very highly motivated_.
               | 
               | It's like with videogames - most of them commit all kinds
               | of UI/UX sins, and I often wish they didn't, but
               | excepting extreme cases, the players are too motivated to
               | care or notice.
        
         | fourseventy wrote:
         | It's almost as bad as the Xbox naming scheme.
        
           | Someone1234 wrote:
           | I don't know if anything is as bad as a games console named
           | "Series."
        
       | siliconc0w wrote:
       | The real heated contest here amongst the top AI labs is to see
       | who can come up with the most confusing product names.
        
         | not_a_bot_4sho wrote:
         | Someone dropped the ball with Phi models. There is clearly an
         | opportunity for XP and Ultimate and X/S editions.
        
           | lja wrote:
           | I really think a "OpenAI Me" is what's needed.
        
           | baq wrote:
           | Personally waiting for the ME model. Should be great at jokes
           | and humor.
        
         | tdb7893 wrote:
         | It's nice to see Google finally having competition in a space
         | it used to really dominate (though they definitely still are
         | holding their own with all the Gemini naming). I feel like it
         | takes real effort to have product names be this confusing and
         | capricious
        
           | gundmc wrote:
           | Gemini naming seems pretty straightforward at this point. 2.0
           | is the full model, flash is a smaller/faster/cheaper model,
           | and flash thinking is a smaller/faster/cheaper reasoning
           | model with Cost.
        
             | coder543 wrote:
             | > 2.0 is the full model
             | 
             | Not quite. "2.0 Flash" is also called 2.0. The "Pro" models
             | are the full models. But, I love how they have both
             | "gemini-exp-1206" and "gemini-2.0-flash-thinking-
             | exp-01-21". The first one doesn't even say what type of
             | model it is, presumably it should have been
             | "gemini-2.0-pro-exp-1206", but they didn't want to label it
             | that for some reason, and now they're putting a hyphen in
             | the date string where they weren't before.
             | 
             | Not to mention they have both "Flash" and "Flash-8B"...
             | which I think will confuse people. IMO, it should be
             | "Flash-${Parameters}B" for both of them if they're going to
             | mention it for one.
             | 
             | But, I generally think Google's Gemini naming structure has
             | been pretty decent.
        
         | TheOtherHobbes wrote:
         | Surprised Apple hasn't gone with iI Pro Max.
        
       | dilap wrote:
       | Haven't used openai in a bit -- whyyy did they change "system"
       | role (now basically an industry-wide standard) to "developer"?
       | That seems pointlessly disruptive.
        
         | logicchains wrote:
         | They mention in the model card, it's so that they can have a
         | separate "system" role that the user can't change, and they
         | trained the model to prioritise it over the "developer" role,
         | to combat "jailbreaks". Thank God for DeepSeek.
        
           | sroussey wrote:
           | They should have just created something above system and left
           | as it was.
        
             | Etheryte wrote:
             | Agreed, just add root and call it a day. Everyone who needs
             | to care can instantly guesstimate what it is.
        
         | BoorishBears wrote:
         | 2 years ago I'd say it's an oversight, because there's 0 chance
         | a top down directive would ask for this.
         | 
         | But given how OpenAI employees act online these days I wouldn't
         | be surprised if someone on the ground proposed it as a way to
         | screw with all the 3rd parties who are using OpenAI compatible
         | endpoints or even use OpenAI's SDK in their official docs in
         | some cases.
        
       | kaaskop wrote:
       | How's this compare to Mistral Small 3?
        
         | coder543 wrote:
         | Mistral Small 3 is roughly comparable in capabilities to
         | 4o-mini (apart from 4o-mini's support for multimodality)...
         | o1-mini was already better than GPT-4o (full size) for tasks
         | like writing code, and this is supposedly better than o1 (full
         | size) for those tasks, so... o3-mini is supposedly in a
         | completely different league from Mistral Small 3, and it's not
         | even close.
         | 
         | Of course, the model has only been out for a few hours, so
         | whether it lives up to the benchmarks or not isn't really known
         | yet.
        
       | highfrequency wrote:
       | Anyone else confused by inconsistency in performance numbers
       | between this announcement and the concurrent system card?
       | https://cdn.openai.com/o3-mini-system-card.pdf
       | 
       | For example-
       | 
       | GPQA diamond system card: o1-preview 0.68
       | 
       | GPQA diamond PR release: o1-preview 0.78
       | 
       | Also, how should we interpret the 3 different shading colors in
       | the barplots (white, dotted, heavy dotted on top of white)...
        
         | kkzz99 wrote:
         | Actually sounds like benchslop to me.
        
       | airstrike wrote:
       | Hopefully this is a big improvement from o1.
       | 
       | o1 has been very disappointing after spending sufficient time
       | with Claude Sonnet 3.5. It's like it actively tries to gaslight
       | me and thinks it knows more than I do. It's too stubborn and
       | confidently goes off in tangents, suggesting big changes to parts
       | of the code that aren't the issue. Claude tends to be way better
       | at putting the pieces together in its not-quite-mental-model, so
       | to speak.
       | 
       | I told o1 that a suggestion it gave me didn't work and it said
       | "if it's still 'doesn't work' in your setup..." with "doesn't
       | work" in quotes like it was doubting me... I've canceled my
       | ChatGPT subscription and, when I really need to use it, just go
       | with GPT-4o instead.
        
         | Deegy wrote:
         | I've also noticed that with cGPT.
         | 
         | That said I often run into a sort of opposite issue with
         | Claude. It's very good at making me feel like a genius.
         | Sometimes I'll suggest trying a specific strategy or trying to
         | define a concept on my own, and Claude enthusiastically agrees
         | and takes us down a 2-3 hour rabbit hole that ends up being
         | quite a waste of time for me to back track out of.
         | 
         | I'll then run a post-mortem through chatGPT and very often it
         | points out the issue in my thinking very quickly.
         | 
         | That said I keep coming back to sonnet-3.5 for reasons I can't
         | perfectly articulate. Perhaps because I like how it fluffs my
         | ego lol. ChatGPT on the other hand feels a bit more brash. I do
         | wonder if I should be using o1 as my daily driver.
         | 
         | I also don't have enough experience with o1 to determine if it
         | would also take me down dead ends as well.
        
           | bazmattaz wrote:
           | Really interesting point you make about Claude. I've
           | experienced the same. What is interesting is that sometimes
           | I'll question it and say "would it not be better to do it
           | this way" and all of a sudden Claude u-turns and says "yes
           | great idea that's actually a much better approach" which
           | leaves me thinking; are you just stroking my ego, if it's a
           | better approach then why didn't you suggest it?
           | 
           | However I have suggested worse approaches on purpose and
           | sometime Claude does pick them up as less than optimal
        
           | airstrike wrote:
           | I agree with this but o1 will _also_ confidently take you
           | into rabbit holes. You 'll just feel worse about it lol and
           | when you ask Claude for a post mortem, it too will find the
           | answer you missed quickly
           | 
           | The truth is these models are very stochastic you have to try
           | new chats whenever you even moderately suspect you're going
           | awry
        
           | mordae wrote:
           | It's a little sycophant.
           | 
           | But the difference is that it actually asks questions. And
           | also that it actually rolls with what you ask it to do. Other
           | models are stubborn and loopy.
        
       | ilaksh wrote:
       | It looks like a pretty significant increase on SWE-Bench.
       | Although that makes me wonder if there was some formatting or
       | gotcha that was holding the results back before.
       | 
       | If this will work for your use case then it could be a huge
       | discount versus o1. Worth trying again if o1-mini couldn't handle
       | the task before. $4/million output tokens versus $60.
       | 
       | https://platform.openai.com/docs/pricing
       | 
       | I am Tier 5 but I don't believe I have access to it in the API
       | (at least it's not on the limits page and I haven't received an
       | email). It says "rolling out to select Tier 3-5 customers" which
       | means I will have to wait around and just be lucky I guess.
        
         | TechDebtDevin wrote:
         | Genuinely curious, What made you choose OpenAI as your
         | preferred api provider? Its always been the least attractive to
         | me.
        
           | TeMPOraL wrote:
           | Until recently they were the only game in town, so maybe they
           | accrued significant spend back then?
        
           | ilaksh wrote:
           | I have mainly been using Claude 3.5/3.6 Sonnet via API in the
           | last several months (or since 3.5 Sonnet came out). However,
           | I was using o1 for a challenging task at one point, but last
           | I tested it had issues with some extra backslashes for that
           | application.
           | 
           | I also have tested with DeepSeek R1 and will test some more
           | with that although in a way Claude 3.6 with CoT is pretty
           | good. Last time I tried to test R1 their API was out.
        
           | ipaddr wrote:
           | Who else might be a good choice? Deepseek is down. Who has
           | the cheapest gpt3.5 level or above api
        
             | TechDebtDevin wrote:
             | Ive personaly been using Deepseek (which has been better
             | than for 3.5 for a really long time), and Perplexity, which
             | is nice for their built in search. Ive actually been using
             | Deepseek since it was free. Its been generally good for me.
             | Ive mostly chosen both because of pricing as I generally
             | dont use APIs for extermely complex prompts.
        
             | Aperocky wrote:
             | Run it locally, the distilled smaller ones aren't bad at
             | all.
        
           | eknkc wrote:
           | We extensively used the batch APIs to decrease cost and
           | handle large amount of data. I also need JSON responses for a
           | lot of things and OpenAI seem to have the best json schema
           | output option out there.
        
         | TeMPOraL wrote:
         | Tier 3 here and already see it on Limits page, so maybe the
         | wait won't be long.
        
           | ilaksh wrote:
           | Yep, I got an email about o3-mini in the API an hour ago.
        
             | TeMPOraL wrote:
             | I apparently got one at the same time too, but I missed it
             | distracted by this HN thread :). Not only I got o3-mini
             | (which I already noticed on the Limits page), but they also
             | gave me access to o1 now! I'm Tier 3; until yesterday, o1
             | was still Tier 5 (IIRC).
             | 
             | Thanks OpenAI! Nice gift and a neat distraction from
             | DeepSeek-R1 - which I still can't use directly, because
             | their API stopped working moments after I topped up my
             | credits and generated an API key, and is still down for
             | me... :/.
        
         | sshh12 wrote:
         | Tier 5 and I got it almost instantly
        
       | georgewsinger wrote:
       | Did anyone else notice that o3-mini's SWE bench dropped from 61%
       | in the leaked System Card earlier today to 49.3% in this blog
       | post, which puts o3-mini back in line with Claude on real-world
       | coding tasks?
       | 
       | Am I missing something?
        
         | logicchains wrote:
         | Maybe they found a need to quantize it further for release, or
         | lobotomise it with more "alignment".
        
           | kkzz99 wrote:
           | Or the number was never real to begin with.
        
           | ben_w wrote:
           | > lobotomise
           | 
           | Anyone can write very fast software if you don't mind it
           | sometimes crashing or having weird bugs.
           | 
           | Why do people try to meme as if AI is different? It has
           | unexpected outputs sometimes, getting it to not do that is
           | 50% "more alignment" and 50% "hallucinate less".
           | 
           | Just today I saw someone get the Amazon bot to roleplay furry
           | erotica. Funny, sure, but it's still obviously a bug that a
           | *sales bot* would do that.
           | 
           | And given these models do actually get stuff wrong, is it
           | really _incorrect_ for them to refuse to help with things
           | they might be dangerous if the user isn 't already skilled,
           | like Claude in this story about DIY fusion?
           | https://www.corememory.com/p/a-young-man-used-ai-to-
           | build-a-...
        
             | Rastonbury wrote:
             | They are implying the release was rushed and they had to
             | reduce the functionality of the model in order to make sure
             | it did not teach people how to make dirty bombs
        
             | bee_rider wrote:
             | If somebody wants their Amazon bot to role play as an
             | erotic furry, that's up to them, right? Who cares. It is
             | working as intended if it keeps them going back to the site
             | and buying things I guess.
             | 
             | I don't know why somebody would want that, seems annoying.
             | But I also don't expect people to explain why they do this
             | kind of stuff.
        
               | ben_w wrote:
               | It's still a bug. Not really working as intended -- it
               | doesn't sell anything from that.
               | 
               | A very funny bug, but a bug nonetheless.
               | 
               | And given this was shared via screenshots, it was done
               | for a laugh.
        
         | jakereps wrote:
         | The caption on the graph explains.
         | 
         | > including with the open-source Agentless scaffold (39%) and
         | an internal tools scaffold (61%), see our system card .
         | 
         | I have no idea what an "internal tools scaffold" is but the
         | graph on the card that they link directly to specifies "o3-mini
         | (tools)" where the blog post is talking about others.
        
           | DrewHintz wrote:
           | I'm guessing an "internal tools scaffold" is something like
           | Goose: https://github.com/block/goose
           | 
           | Instead of just generating a patch (copilot style), it
           | generates the patch, applies the patch, runs the code, and
           | then iterates based on the execution output.
        
         | anothermathbozo wrote:
         | I think this is with and without "tools." They explain it in
         | the system card:
         | 
         | > We evaluate SWE-bench in two settings: > ** Agentless*, which
         | is used for all models except o3-mini (tools). This setting
         | uses the Agentless 1.0 scaffold, and models are given 5 tries
         | to generate a candidate patch. We compute pass@1 by averaging
         | the per-instance pass rates of all samples that generated a
         | valid (i.e., non-empty) patch. If the model fails to generate a
         | valid patch on every attempt, that instance is considered
         | incorrect.
         | 
         | > ** o3-mini (tools)*, which uses an internal tool scaffold
         | designed for efficient iterative file editing and debugging. In
         | this setting, we average over 4 tries per instance to compute
         | pass@1 (unlike Agentless, the error rate does not significantly
         | impact results). o3-mini (tools) was evaluated using a non-
         | final checkpoint that differs slightly from the o3-mini launch
         | candidate.
        
           | georgewsinger wrote:
           | Makes sense. Thanks for the correction.
        
           | Bjorkbat wrote:
           | So am I to understand that they used their internal tooling
           | scaffold on the o3(tools) results only? Because if so, I
           | really don't like that.
           | 
           | While it's nonetheless impressive that they scored 61% on
           | SWE-bench with o3-mini combined with their tool scaffolding,
           | comparing Agentless performance with other models seems less
           | impressive, 40% vs 35% when compared to o1-mini if you look
           | at the graph on page 28 of their system card pdf
           | (https://cdn.openai.com/o3-mini-system-card.pdf).
           | 
           | It just feels like data manipulation to suggest that o3-mini
           | is much more performant than past models. A fairer picture
           | would still paint a performance improvement, but it look less
           | exciting and more incremental.
           | 
           | Of course the real improvement is cost, but still, it kind of
           | rubs me the wrong way.
        
             | pockmarked19 wrote:
             | YC usually says "a startup is the point in your life where
             | tricks stop working".
             | 
             | Sam Altman is somehow finding this out now, the hard way.
             | 
             | Most paying customers will find out within minutes whether
             | the models can serve their use case, a benchmark isn't
             | going to change that except for media manipulation (and
             | even that doesn't work all that well, since journalists
             | don't really know what they are saying and readers can
             | tell).
        
       | OutOfHere wrote:
       | Wake me up when the full o3 is out.
        
         | therein wrote:
         | My guess is it will happen right after Sam Altman's next public
         | freakout about how dangerous this new model they have in store
         | is and how it tried to escape from its confinement and kidnap
         | the alignment operator.
        
           | ls_stats wrote:
           | That's pretty much what Altman said about GPT-3 (or 2, I
           | don't remember), he said it was too dangerous to release to
           | the public.
        
       | msp26 wrote:
       | I wish they'd just reveal the CoT (like gemini and deepseek do),
       | it's very helpful to see when the model gets misled by something
       | in your prompt. Paying for tokens you aren't even allowed to see
       | is peak OpenAI.
        
         | tucnak wrote:
         | I'm sorry, but it's over for OpenAI. Some have predicted this;
         | including me back in November[1] when I wrote "o1 is a
         | revolution in accounting, not capability" which although
         | tongue-in-cheek, has so far turned out to be correct. I'm only
         | waiting to see what Google, Facebook et al. will accomplish now
         | that R1-Zero result is out the bag. The nerve, the cheek of
         | this hysterical o3-mini release--insisting to hide the COT from
         | the consumer still, is telling us one thing and one thing
         | alone: OpenAI is no longer able to adapt to the ever-changing
         | landscape. Maybe the Chinese haven't beaten them yet, but
         | Google, Facebook et al. absolutely will, & without having to
         | resort to deception.
         | 
         | [1]:
         | https://old.reddit.com/r/LocalLLaMA/comments/1gna0nr/popular...
        
           | mediaman wrote:
           | You don't need to wait for Google. Their Jan 21 checkpoint
           | for their fast reasoning model is available on AIStudio. It
           | shows full reasoning traces. It's very good, much faster than
           | R1, and although they haven't released pricing, based on
           | flash it's going to be quite cheap.
        
             | tucnak wrote:
             | Sure, their 01-21 reasoning model is really good, but
             | there's no pricing for it!
             | 
             | I care mostly about batching in Vertex AI, which is 17-30x
             | times cheaper than competition (whether you use prompt
             | caching or not) while allowing for audio, video, and
             | arbitrary document filetype inputs; unfortunately Gemini
             | 1.5 Pro/Flash have remained the two so-called "stable"
             | options that are available there. I can appreciate Google's
             | experimental models for all I can, but I cannot take them
             | seriously until they allow me to have my sweet, sweet
             | batches.
        
         | liamwire wrote:
         | sama and OpenAI's CPO Kevin Weil both suggested this is coming
         | soon, as a direct response to DeepSeek, in an AMA a few hours
         | ago: https://www.reddit.com/r/OpenAI/s/EElFfcU8ZO
        
       | kumarm wrote:
       | I ran some quick programming tasks I have used O1 previously:
       | 
       | 1. 1/4th time for reasoning for most tasks.
       | 
       | 2. Far better results.
        
         | CamperBob2 wrote:
         | Compared to o1 or o1-pro?
        
           | yzydserd wrote:
           | A few quick tasks look to me like o3-mini-high is 4-10x
           | faster for 80% of the quality. It gives very good and
           | sufficient fast reasoning about coding tasks, but I think I'd
           | ask o1-pro to do the task ie provide the code. o3-mini-high
           | can keep up with you at thinking / typing speed, whereas
           | o1-pro can take several minutes. Just a quick view after
           | playing for an hour.
        
       | Bjorkbat wrote:
       | I have to admit I'm kind of surprised by the SWE-bench results.
       | At the highest level of performance o3-mini's CodeForces score
       | is, well, high. I've honestly never really sat down to understand
       | how elo works, all I know is that it scored better than o1, which
       | allegedly as better than ~90% of all competitors on CodeForces.
       | So, you know, o3-mini is pretty good at CodeForces.
       | 
       | But it's SWE-bench scores aren't meaningfully better than Claude,
       | 49.3 vs Claude's 49.0 on the public leaderboard (might be higher
       | now due to recent updates?)
       | 
       | My immediate thoughts, CodeForces (and competitive programming in
       | general) is a poor proxy for performance on general software
       | engineering tasks. Besides that, for all the work put into
       | OpenAI's most recent model it still has a hard time living up to
       | an LLM initially released by Anthropic some time ago, at least
       | according to this benchmark.
       | 
       | Mind you, the Github issues that the problems in SWE-bench were
       | based-off have been around long enough that it's pretty much a
       | given that they've all found their way into the training data of
       | most modern LLMs, so I'm really surprised that o3 isn't
       | meaningfully better than Sonnet.
        
         | dagelf wrote:
         | I think the innovation here is probably that its a much smaller
         | and so cheaper model to run.
        
         | aprilthird2021 wrote:
         | > My immediate thoughts, CodeForces (and competitive
         | programming in general) is a poor proxy for performance on
         | general software engineering task
         | 
         | Yep. A general software engineering task has a lot of
         | information encoded in it that is either already known to a
         | human or is contextually understood by a human.
         | 
         | A competitive programming task often has to provide all the
         | context as it's not based off an existing product or codebase
         | or technology or paradigm known to the user
        
         | vectorhacker wrote:
         | Yeah, I no longer consider the SWE-bench useful because these
         | models can just "memorize" the solutions to the PRs.
        
       | _boffin_ wrote:
       | why is o1-pro not mentioned in there?
        
       | Oras wrote:
       | 200k context window
       | 
       | $1.1/m for input
       | 
       | $4.4/m for output
       | 
       | I assume thinking medium and hard would consume more tokens.
       | 
       | I feel the timing is bad for this release especially when
       | deepseek R1 is still peaking. People will compare and might get
       | disappointed with this model.
        
         | GaggiX wrote:
         | The model looks quite a bit better in the benchmarks so unless
         | they overfit the model on them it would probably perform better
         | than deepseek.
        
           | WiSaGaN wrote:
           | My vibe question checking suggests otherwise. Even o3-mini-
           | high is not as good as r1, even though it's faster than r1.
           | Considering o3-mini is more expensive per token. It's not
           | clear o3-mini-high is cheaper than r1 either even r1 probably
           | consumes more token per answer.
        
             | kandesbunzler wrote:
             | well in my anecdotal tests, o3 mini (free) performed better
             | than r1
        
               | GaggiX wrote:
               | Also in my coding testing o3 mini (free) is better than
               | r1.
        
               | WiSaGaN wrote:
               | I did math tests. Probably you did coding.
        
         | kandesbunzler wrote:
         | I compared free o3 mini vs Deepseek R1 (on their website) and
         | in my tests o3 performed better every time (did some coding
         | tests)
        
       | IMTDb wrote:
       | I really don't get the point of those oX-mini models for chat
       | apps. (API is different, we can benchmark multiple models for a
       | given recurring taks and choose the best one taking costs into
       | consideration). As part of my job, I am trying to promote usage
       | of AI in my company (~150 FTE); we have an OpenAI chatGPT plus
       | subscription for all employees.
       | 
       | Roughly speaking the message is: "use GPT-4o all the time, use o1
       | (soon o3) if you have more complex tasks". What am I supposed to
       | answer when people ask "when am I supposed to use o3-mini ? . And
       | what the heck is o3-mini-high, how do I know when to use it ?".
       | People aren't gonna ask the same question to 5 different models
       | and burn all their rate limits; yet it feels that what's openAI
       | is hoping people will do.
       | 
       | Put those weirs models in a sub-menu for advanced users if you
       | really want to, but is you can use o1 there is probably no reason
       | for you to hake o3-mini _and_ o3-mini-high as additional options.
        
         | oezi wrote:
         | Why not promote o1? 4o is rather sloppy in comparison
        
           | IMTDb wrote:
           | 99% of what people use ChatGPT is for very mundane stuff.
           | Think "translate this email to English", "fix spelling
           | mistakes", "write this better for me". Data extraction (list
           | of emails) is big as well. You don't need o1 for that; and
           | people make lot of those requests per day.
           | 
           | Additionally, o1 does not have access to search and
           | multimodality and taking a screenshot of something and asking
           | questions about it is also a big use case.
           | 
           | It's easy to overlook how widely ChatGPT is used for _very_
           | small stuff. But compounded it's still a game changer for
           | many people.
        
       | xinayder wrote:
       | "oh no DeepSeek copied our product it's not fair"
       | 
       | > proceeds to release a product based on DeepSeek
       | 
       | ah, alas the hypocrisy...
        
         | feznyng wrote:
         | o3 was announced in December. R1 arguably builds off the
         | rumored approach of o1 (LLM + RL) although with major
         | efficiency gains. I'm not a big fan of OpenAI but it's the
         | other way around.
        
         | Rooster61 wrote:
         | The thing they previewed back in December before the whole
         | Deepseek kerfuffle this week?
         | 
         | Don't get me wrong, I'm laughing at OpenAI just like everyone
         | else, but if they were really copying Deepseek, they'd be
         | releasing a smaller model distilled from Deepseek API
         | responses, and have it be open source to boot. This is neither
        
       | yapyap wrote:
       | They sure scrambled something together after DeepSeek sweeped the
       | market.
        
         | GoatInGrey wrote:
         | Indeed. Everyone knows that one can cobble together a frontier
         | model and deploy it within three weeks.
        
           | TechDebtDevin wrote:
           | Not to mention the model has been available to researchers
           | for a month.
        
       | mhb wrote:
       | Maybe they can get some advice from the AWS instance naming
       | group.
        
       | og_kalu wrote:
       | R1 seems to be the only of these reasoning models that seem to
       | have had gains in the creative writing side.
        
         | nimithryn wrote:
         | Am I the only one who thinks that R1 is _awful_ at creative
         | writing? I 've seen a lot of very credulous posts on twitter
         | that are super excited about excerpts written by DeepSeek that
         | I think are absolutely absymal. Am I alone in this? Maybe
         | people have very different tastes than I do?
         | 
         | (I have no formal training in creative writing, though I do
         | read a lot of literature. Not claiming my tastes are superior -
         | genuinely curious if other people disagree).
        
           | og_kalu wrote:
           | I mean, do you think this is awful ?
           | 
           | https://pastebin.com/Ja14mt6L
        
       | throwaway314155 wrote:
       | Typical OpenAI release announcement where it turns out they're
       | _actually_ doing some sort of delayed rollout and despite what
       | the announcement says, no - you can't use o3-mini today.
        
       | feverzsj wrote:
       | It's already a dead end for a while now, as they can't improve o1
       | meaningfully anymore. The market is also losing patience quickly.
        
       | czk wrote:
       | im just glad it looks like o3-mini finally has internet access
       | 
       | the o1 models were already so niche that i never used them, but
       | not being able to search the web made them even more useless
        
       | oytis wrote:
       | Let me guess - everyone is mindblown.
        
       | estsauver wrote:
       | I couldn't find in the documentation anything that describes the
       | relative number of tokens that you get for low/medium/high. I'm
       | curious if anyone can find that, I'd be curious to see how it
       | plays out relative to DeepSeeks thinking sections.
        
       | isusmelj wrote:
       | Does anyone know why GPT4 has knowledge cutoff December 2023 and
       | all the other models (newer ones like 4o, O1, O3) seem to have
       | knowledge cutoff October 2023?
       | https://platform.openai.com/docs/models#o3-mini
       | 
       | I understand that keeping the same data and curating it might be
       | beneficial. But it sounds odd to roll back in time with the
       | knowledge cutoff. AFAIK, the only event that happened around that
       | time was the start of the Gaza conflict.
        
         | kikki wrote:
         | I think trained knowledge is less and less important - as these
         | multi-modal models have the ability to search the web and have
         | much larger context windows.
        
       | andrewstuart wrote:
       | I find Claude to be vastly better than any OpenAI model as a
       | programming assistant.
       | 
       | In particular the "reasoning" models just seem to be less good
       | and more slow.
        
       | chad1n wrote:
       | I think that OpenAI should reduce the prices even further to be
       | competitive with Qwen or Deepseek. There are a lot of vendors
       | offering Deepseek R1 for $2-2.5 per 1 million tokens output.
        
         | othello wrote:
         | Would you have specific recommendations of such vendors?
        
           | chad1n wrote:
           | For example, `https://deepinfra.com/` which asks for $2.5 per
           | million on output or https://nebius.com which asks for $2.4
           | per million output tokens.
        
             | BoorishBears wrote:
             | As the sibling comment mentions, you're not getting
             | anything production grade for less than $7 per million and
             | that's on _input and output_.
             | 
             | Nebius is single digit TPS. _31 seconds_ to reply to
             | "What's 1+1".
             | 
             | Hopefully Deepseek will make it out of their current
             | situation because in a very ironic way, the thing the
             | entire market lost its mind over is not actually usable at
             | the pricing that drove the hype:
             | https://openrouter.ai/deepseek/deepseek-r1
        
           | druskacik wrote:
           | Well, it's $2.19 per million output tokens even directly on
           | deepseek platform.
           | 
           | https://api-docs.deepseek.com/quick_start/pricing/
        
             | BoorishBears wrote:
             | Their API platform has been down for 48 hours at this point
        
           | rsanek wrote:
           | If you want reliable service you're going to pay more around
           | $7~8 per million tokens. Sister commenters mention providers
           | that are considered unstable
           | https://openrouter.ai/deepseek/deepseek-r1
        
       | secondcoming wrote:
       | Anyone else stuck in a Cloudflare 'verify you're a human' doom
       | loop?
        
       | tempeler wrote:
       | They made a discount; it's very impressive; they probably found a
       | very efficient way, so it's discounted. I guess there's no need
       | to build a very large nuclear power plant or a $9 trillion chip
       | factory to run a single large language model. Efficiency has
       | skyrocketed, or thanks to competition, OpenAI's all problems were
       | solved.
        
       | jen729w wrote:
       | > Testers preferred o3-mini's responses to o1-mini 56% of the
       | time
       | 
       | I hope by this they don't mean me, when I'm asked 'which of these
       | two responses do you prefer'.
       | 
       | They're both 2,000 words, and I asked a question because I have
       | something to do. _I 'm not reading them both_; I'm usually just
       | selecting the one that answered first.
       | 
       | That prompt is pointless. Perhaps as evidenced by the essentially
       | 50% response rate: it's a coin-flip.
        
         | danielmarkbruce wrote:
         | RLUHF, U = useless.
        
         | brookst wrote:
         | Those prompts are so irritating and so frequent that I've taken
         | to just quickly picking whichever one looks worse at a cursory
         | glance. I'm paying them, they shouldn't expect high quality
         | work from me.
        
           | apparent wrote:
           | Have you considered the possibility that your feedback is
           | used to choose what type of response to give to you
           | specifically in the future?
           | 
           | I would not consider purposely giving inaccurate feedback for
           | this reason alone.
        
             | isaacremuant wrote:
             | Alternatively, I'll use the tool that is most user friendly
             | and provides the most value for my money.
             | 
             | Wasting time on an anti pattern is not value nor is it
             | trying to outguess the way that selection mechanism is
             | used.
        
             | MattDaEskimo wrote:
             | I don't want a model that's customized to my preferences.
             | My preferences and understanding changes all the time.
             | 
             | I want a single source model that's grounded in base truth.
             | I'll let the model know how to structure it in my prompt.
        
               | szundi wrote:
               | Constang meh and fixing prompts to the right direction vs
               | unable to escape the bubble
        
               | kenjackson wrote:
               | You know there's no such as base truth here? You want to
               | write something like this to start your prompts, "Respond
               | in English, using standard capitalization and
               | punctuation, following rules of grammar as written by
               | Strunk & White, where numbers are represented using
               | arabic numerals in base 10 notation...."???
        
               | AutistiCoder wrote:
               | actually, I might appreciate that.
               | 
               | I like precision of language, so maybe just have a system
               | prompt that says "use precise language (ex: no symbolism
               | of any kind)"
        
               | MattGaiser wrote:
               | A lot of preferences have nothing to do with any truth.
               | Do you like code segments or full code? Do you like
               | paragraphs or bullet points? Heck, do you want English or
               | Japanaese?
        
               | orbital-decay wrote:
               | What is base truth for e.g. creative writing?
        
             | francis_lewis wrote:
             | I think my awareness that this may influence future
             | responses has actually been detrimental to my response
             | rate. The responses are often so similar that I can imagine
             | preferring either in specific circumstances. While I'm sure
             | that can be guided by the prompt, I'm often hesitant to
             | click on a specific response as I can see the value of the
             | other response in a different situation and I don't want to
             | bias the future responses. Maybe with more specific
             | prompting this wouldn't be such an issue, or maybe more of
             | an understanding of how inter-chat personalisation is
             | applied (maybe I'm missing some information on this too).
        
             | Der_Einzige wrote:
             | Spotted the pissed off OpenAI RLHF engineer! Hahahahaha!
        
           | Tenoke wrote:
           | That's such a counter-productive and frankly dumb thing to
           | do. Just don't vote on them.
        
             | explain wrote:
             | You have to pick one to continue the chat.
        
               | apparent wrote:
               | Why not always pick the one on the left, for example? I
               | understand wanting to speed through and not spend time
               | doing labor for OpenAI, but it seems counter-productive
               | to spend any time feeding it false information.
        
               | brookst wrote:
               | My assumption is they measure the quality of user
               | feedback, either on a per user basis or in an aggregate.
               | I want them to interrupt me less, so I want them to
               | either decide I'm a bad teacher or that users in general
               | are bad teachers.
        
               | DiggyJohnson wrote:
               | I know for a fact that as of yesterday I did not have to
               | pick one to continue the conversation. It just maximizes
               | the second choice and displayed a 2/2 below the response.
        
         | jackbrookes wrote:
         | Yes I'd bet most users just 50/50 it, which actually makes it
         | more remarkable that there was a 56% selection rate
        
           | cgriswald wrote:
           | I read the one on the left but choose the shorter one.
           | 
           | The interface wastes so much screen real estate already and
           | the answers are usually overly verbose unless I've given
           | explicit instructions on how to answer.
        
             | ljm wrote:
             | The default level of verbosity you get without explicitly
             | prompting for it to be succinct makes me think there's an
             | office full of workers getting paid by the token.
        
               | internetter wrote:
               | In my experience the verbosity significantly improves
               | output quality
        
         | johnneville wrote:
         | they also pay contractors to do these evaluations with much
         | more detailed metrics, no idea which their number is based on
         | though
        
         | dkjaudyeqooe wrote:
         | It's kind of strange that they gave that stat. Maybe they
         | thought people would somehow think about "56% better" or
         | something.
         | 
         | Because when you think about it, it really is quite damning.
         | Minus statistical noise it's no better.
        
           | fsndz wrote:
           | exactly I was surprised as well
        
           | Powdering7082 wrote:
           | That would be 12%, why would you assume that is eaten by
           | statistical noise?
        
             | senorrib wrote:
             | The OPs comment is probably a testament of that. With such
             | a poorly designed A/B test I doubt this has a p-value of <
             | 0.10.
        
               | throwaway287391 wrote:
               | Erm, why not? A 0.56 result with n=1000 ratings is
               | statistically significantly better than 0.5 with a
               | p-value of 0.00001864, well beyond any standard
               | statistical significance threshold I've ever heard of. I
               | don't know how many ratings they collected but 1000
               | doesn't seem crazy at all. Assuming of course that raters
               | are blind to which model is which and the order of the 2
               | responses is randomized with every rating -- or, is that
               | what you meant by "poorly designed"? If so, where do they
               | indicate they failed to randomize/blind the raters?
        
               | n2d4 wrote:
               | Because you're not testing "will a user click the left or
               | right button" (for which asking a thousand users to click
               | a button would be a pretty good estimation), you're
               | testing "which response is preferred".
               | 
               | If 10% of people just click based on how fast the
               | response was because they don't want to read both
               | outputs, your p-value for the latter hypothesis will be
               | atrocious, no matter how large the sample is.
        
               | johnmaguire wrote:
               | > If 10% of people just click based on how fast the
               | response was
               | 
               | Couldn't this be considered a form of preference?
               | 
               | Whether it's the type of preference OpenAI was testing
               | for, or the type of preference you care about, is another
               | matter.
        
               | n2d4 wrote:
               | Sure, it could be, you can define "preference" as
               | basically anything, but it just loses its meaning if you
               | do that. I think most people would think "56% prefer this
               | product" means "when well-informed, 56% of users would
               | rather have this product than the other".
        
               | throwaway287391 wrote:
               | Yes, I am assuming they evaluated the models in good
               | faith, understand how to design a basic user study, and
               | therefore when they ran a study intended to compare the
               | response quality between two different models, they
               | showed the raters both fully-formed responses at the same
               | time, regardless of the actual latency of each model.
        
               | n2d4 wrote:
               | I would recommend you read the comment that started this
               | thread then, because that's the context we're talking
               | about: https://news.ycombinator.com/item?id=42891294
        
               | throwaway287391 wrote:
               | I did read that comment. I don't think that person is
               | saying they were part of the study that OpenAI used to
               | evaluate the models. They would probably know if they had
               | gotten paid to evaluate LLM responses.
               | 
               | But I'm glad you pointed that out, I now suspect that is
               | responsible for a large part of the disagreement between
               | "huh? a statistically significant blind evaluation is a
               | statistically significant blind evaluation" vs "oh, this
               | was obviously a terrible study" repliers is due to
               | different interpretations of that post. Thanks. I
               | genuinely didn't consider the alternative interpretation
               | before.
        
               | godelski wrote:
               | > If so, where do they indicate they failed to
               | randomize/blind the raters?            Win rate if user
               | is under time constraint
               | 
               | This is hard to read tbh. Is it STEM? Non-STEM? If it is
               | STEM then this shows there is a bias. If it is Non-STEM
               | then this shows a bias. If it is a mix, well we can't
               | know anything without understanding the split.
               | 
               | Note that Non-STEM is still within error. STEM is less
               | than 2 sigma variance, so our confidence still shouldn't
               | be that high.
        
             | aqme28 wrote:
             | They even include error bars. It doesn't seem to be
             | statistical noise, but it's still not great.
        
           | afro88 wrote:
           | Yeah. I immediately thought: I wonder if that 56% is in one
           | or two categories and the rest are worse?
        
             | rvnx wrote:
             | 44% of the people prefers the existing model ?
        
               | KHRZ wrote:
               | With many people too lazy to read 2 walls of text, a lot
               | of picks might be random.
        
               | afro88 wrote:
               | Each question falls into a different category (ie math,
               | coding, story writing etc). Typically models are better
               | at some categories and worse at others. Saying "56% of
               | people preferred responses from o3-mini" makes me wonder
               | if those 56 are only from certain categories and the
               | model isn't uniformly 56% preferred.
        
           | cm2187 wrote:
           | And another way to rephrase it is that almost half of the
           | users prefer the older model, which is terrible PR.
        
             | tgsovlerkhgsel wrote:
             | Not if the goal is to claim that the models deliver
             | comparable quality, but with the new one excelling at
             | something else (here: inferrence cost).
        
             | kettleballroll wrote:
             | Typically in these tests you have three options "A is
             | better", "B is better" or "they're equal/can't decide". So
             | if 56% prefer O3 Mini, it's likely that way less than half
             | prefer O1.also, the way I understand it, they're comparing
             | a mini model with a large one.
        
               | directevolve wrote:
               | If you use ChatGPT, it sometimes gives you two versions
               | of its response, and you have to choose one or the other
               | if you want to continue prompting. Sure, not picking a
               | response might be a third category. But if that's how
               | they were approaching the analysis, they could have put
               | out a more favorable-looking stat.
        
               | ignoramous wrote:
               | > _If you use ChatGPT, it sometimes gives you two
               | versions_
               | 
               | Does no one else hate it when this happens (especially
               | when on a handheld device)?
        
           | m3kw9 wrote:
           | It's 3x cheaper and faster
        
         | mikeInAlaska wrote:
         | Maybe we should take both answers, paste them into a new chat
         | and ask for a summary amalgamation of them
        
         | sharkweek wrote:
         | Funny - I had ChatGPT document some stuff for me this week and
         | asked which responses I preferred as well.
         | 
         | Didn't bother reading either of them, just selected one and
         | went on with my day.
         | 
         | If it were me I would have set up a "hey do you mind if we give
         | you two results and you can pick your favorite?" prompt to weed
         | out people like me.
        
           | usef- wrote:
           | I'm surprised how many people claim to do this. You can just
           | not select one.
        
             | rubyn00bie wrote:
             | I think it's somewhat natural and am not personally
             | surprised. It's easy to quickly select an option, that has
             | no consequence, compared to actively considering that not
             | selecting something is an option. Not selecting something
             | feels more like actively participating than just checking a
             | box and moving on. /shrug
        
             | ssl-3 wrote:
             | We -- the people who live in front of a computer -- have
             | been training ourselves to avoid noticing annoyances like
             | captchas, advertising, and GDPR notices for quite a long
             | time.
             | 
             | We find what appears to be the easiest combination "Fuck
             | off, go away" buttons and use them without a moment of
             | actual consideration.
             | 
             | (This doesn't mean that it's actually the easiest method.)
        
               | grahamj wrote:
               | I can't even believe how many times in a day I
               | frustratedly think "whatever, go away!"
        
           | apparent wrote:
           | I wonder if they down-weight responses that come in too fast
           | to be meaningful, or without sufficient scrolling.
        
           | losteric wrote:
           | That's fine. Your random click would be balanced by someone
           | else randomly clicking
        
         | danilocesar wrote:
         | I almost always pick the second one, because it's closer to the
         | submit button and the one I read first.
        
         | arijo wrote:
         | People could be flipping a coin and the score would be the
         | same.
        
           | brianstrimp wrote:
           | A 12% margin is literally the opposite of a coin flip. Unless
           | you have a really bad coin.
        
             | buggy6257 wrote:
             | You're being downvoted for 3 reasons:
             | 
             | 1) Coming off as a jerk, and from a new account is a bad
             | look
             | 
             | 2) "Literally the opposite of a coin flip" would probably
             | be either 0% or 100%
             | 
             | 3) Your reasoning doesn't stand up without further info; it
             | entirely depends on the sample size. I could have 5 coin
             | flips all come up heads, but over thousands or millions it
             | averages to 50%. 56% on a small sample size is absolutely
             | within margin of error/noise. 56% on a MASSIVE sample size
             | is _statistically_ significant, but isn't even still that
             | much to brag about for something that I feel like they
             | probably intended to be a big step forward.
        
               | brianstrimp wrote:
               | I'm a little puzzled by your response.
               | 
               | 1. The message was net-upvoted. Whether there are
               | downvotes in there I can't tell, but the final karma is
               | positive. A similarly spirited message of mine in the
               | same thread was quite well receive as well.
               | 
               | 2. I can't see how my message would come across as a
               | jerk? I wrote 2 simple sentences, not using any offensive
               | language, stating a mere fact of statistics. Is that
               | being jerk? And a long-winded berating of a new member of
               | the community isn't?
               | 
               | 3. A coin flip is 50%. Anything else is not, once you
               | have a certain sample size. So, this was not. That was my
               | statement. I don't know why you are building a strawman
               | of 5 coin flips. 56% vs 44% is a margin of 12%, as I
               | stated, and with a huge sample size, which they had,
               | that's _massive_ in a space where the returns are deep in
               | "diminishing" territory.
        
         | teeray wrote:
         | This prompt is like "See Attendant" on the gas pump. I'm just
         | going to use another AI instead for this chat.
        
           | ninkendo wrote:
           | Glad to know I'm not the only person who just drives to the
           | next station when I see a "see attendant" message.
        
         | janalsncm wrote:
         | > I'm usually just selecting the one that answered first
         | 
         | Which is why you randomize the order. You aren't a tester.
         | 
         | 56% vs 44% may not be noise. That's why we have p values. It
         | depends on sample size.
        
           | jhardy54 wrote:
           | The order doesn't matter. They often generate tokens at
           | different speeds, and produce different lengths of text. "The
           | one that answered first" != "The first option"
        
         | letmevoteplease wrote:
         | The article says "expert testers."
         | 
         | "Evaluations by expert testers showed that o3-mini produces
         | more accurate and clearer answers, with stronger reasoning
         | abilities, than OpenAI o1-mini. Testers preferred o3-mini's
         | responses to o1-mini 56% of the time and observed a 39%
         | reduction in major errors on difficult real-world questions. W"
        
         | resters wrote:
         | I too have questioned the approach of showing the long side-by-
         | side answers from two different models.
         | 
         | 1) sometimes I wanted the short answer, and so even though the
         | long answer is better I picked the short one.
         | 
         | 2) sometimes both contain code that is different enough that I
         | am inclined to go with the one that is more similar to what I
         | already had, even if the other approach seems a bit more solid.
         | 
         | 3) Sometimes one will have less detail but more big picture
         | awareness and the other will have excellent detail but miss
         | some overarching point that is valuable. Depending on my mood I
         | sometimes choose but it is annoying to have to do so because I
         | am not allowed to say why I made the choice.
         | 
         | The area of human training methodology seems to be a big part
         | of what got Deepseek's model so strong. I read the explanation
         | of the test results as an acknowledgement by OpenAI of some
         | weaknesses in its human feedback paradigm.
         | 
         | IMO the way it should work is that the thumbs up or down should
         | be read in context by a reasoning being and a more in-depth
         | training case should be developed that helps future models
         | learn whatever insight the feedback should have triggered.
         | 
         | Feedback that A is better or worse than B is definitely not (in
         | my view) sufficient except in cases where a response is a total
         | dud. Usually the responses have different strengths and
         | weaknesses and it's pretty subjective which one is better.
        
         | brianstrimp wrote:
         | That makes the result stronger though. _Even though_ many
         | people click randomly, there is _still_ a 12% margin between
         | both groups. Not the world, but still quite a lot.
        
         | dionian wrote:
         | i enjoy it, i like getting two answers for free - often one of
         | them is significantly better and probably the newer model
        
         | mullingitover wrote:
         | You know you can configure default instructions to your
         | prompts, right?
         | 
         | I have something like "always be terse and blunt with your
         | answers."
        
         | gcanyon wrote:
         | I don't think they make it clear: I wonder if they mean testers
         | prefer o3 mini 56% of the time _when they express an opinion_ ,
         | or overall? Some percentage of people don't choose; if that
         | number is 10% and they aren't excluded, that means 56% of the
         | time people prefer o3 mini, 34% of the time people prefer o1
         | mini, and 10% of the time people don't choose. I'm not sure I
         | think it would be reasonable to present the data that way, but
         | it seems possible.
        
         | ricardobeat wrote:
         | This is just a way to prove, statistically, that one model is
         | better than another as part of its validation. It's not
         | collected from normal people using ChatGPT, you don't ever get
         | shown two responses from different models at once.
        
           | yawnxyz wrote:
           | Wait what? I get shown this with ChatGPT maybe 5% of the time
        
             | nearbuy wrote:
             | Those are both responses from the same model. It's not one
             | response from o1 and another from o3.
        
         | directevolve wrote:
         | Also, it's not clear if the preference comes from the quality
         | of the 'meat' of the answer, or the way it reports its thinking
         | and the speed with which it responds. With o1, I get a marked
         | feeling of impatience waiting for it to spit something out, and
         | the 'progress of thought' is in faint grey text I can't read.
         | With o3, the 'progress of thought' comes quickly, with more to
         | read, and is more engaging even if I don't actually get
         | anything more than entertainment value.
         | 
         | I'm not going to say there's nothing substantive about o3 vs.
         | o1, but I absolutely do not put it past Sam Altman to juice the
         | stats every chance he gets.
        
         | energy123 wrote:
         | Then 56% is even more impressive. Example: if 80% choose
         | randomly and 20% choose carefully, that implies an 80%
         | preference rate for o3-mini (0.8*0.2 + 0.5*0.8 = 0.56)
        
         | shombaboor wrote:
         | It seems like the the first response must get chosen a majority
         | of the time just to account for friction
        
       | EcommerceFlow wrote:
       | First thing I noticed on API and Chat for it is THIS THING IS
       | FAST. That alone makes it a huge upgrade to o1-pro (not really
       | comparable I know, just saying). Can't imagine how much I'll get
       | done with this type of speed.
        
       | GaggiX wrote:
       | The API pricing is almost exactly double the deepseek ones.
        
         | cheema33 wrote:
         | I like deepseek a lot. But they are currently very glitchy. The
         | API service goes up and down a lot. Maybe they'll sort that out
         | soon.
        
           | orbital-decay wrote:
           | Apparently they're under a very targeted DDoS for almost a
           | month, with technical details shared in Chinese but very
           | little discussion in English. Which is surprising, it's not
           | like major AI products are getting DDoSed out of existence
           | every day.
        
             | nmfisher wrote:
             | Where are the details in Chinese?
        
             | bearjaws wrote:
             | Almost all of thm are protected by cloudflare if you look.
             | 
             | My guess is Deepseek didn't implement anti-DDOS until way
             | too late.
        
       | mise_en_place wrote:
       | Too little too late IMO. This is not impressive at all, what am I
       | missing here?
        
         | ben_w wrote:
         | There's only two kinds of software, prototype and obsolete.
         | 
         | I was taught that last millennium.
        
           | esafak wrote:
           | That's not true. Is Google Maps a prototype or obsolete?
        
             | ben_w wrote:
             | The website or the database? I'd say the former is obsolete
             | and the latter is still a prototype.
        
               | esafak wrote:
               | I understand that saas products are constantly evolving
               | but this is an unusual definition of obsolescence and
               | prototypes. Google Maps has been running like a tank for
               | two decades, and it is pretty feature complete.
        
         | jstummbillig wrote:
         | Idk, everything: The price point + performance?
        
         | sumedh wrote:
         | > This is not impressive at all, what am I missing here?
         | 
         | Compared to?
        
       | RobinL wrote:
       | Wow - this is seriously fast (o3-mini), and my initial
       | impressions are very favourable. I was asking it to layout quite
       | a complex html form from a schema and it did a very good job.
       | 
       | Looking at the comments on here and the benchmark results I was
       | expecting it to be a bit meh, but initial impressions are quite
       | the opposite
       | 
       | I was expecting it to perhaps be a marginal improvement for
       | complex things that need a lot of 'reasoning', but it seems it's
       | a bit improvement for simple things that you need doing fast
        
         | bn-l wrote:
         | It's 2x the price of R1:
         | https://x.com/deedydas/status/1885440582103031940/photo/1
         | 
         | Is it twice as good though?
        
           | RobinL wrote:
           | Whilst I had tried R1 before, I hadn't paid attention to how
           | fast it was. I just tried some similar prompts and was pretty
           | impressed with speed and quality. I think o3-mini was still a
           | bit quicker though.
        
       | AISnakeOil wrote:
       | The naming convention is so messed up. o1, o3-mini (no o2, no
       | o3???)
        
         | igravious wrote:
         | https://www.perplexity.ai/search/new?q=list%20of%20all%20Ope...
         | :)
         | 
         | OpenAI has developed a variety of models that cater to
         | different applications, from natural language processing to
         | image generation and audio processing. Here's a comprehensive
         | list of the current models available:                  ##
         | Language Models        - \*GPT-4o\*: The flagship model capable
         | of processing text, images, and audio.        - \*GPT-4o
         | mini\*: A smaller, more cost-effective version of GPT-4o.
         | - \*GPT-4\*: An advanced model that improves upon GPT-3.5.
         | - \*GPT-3.5\*: A set of models that enhance the capabilities of
         | GPT-3.        - \*GPT-3.5 Turbo\*: A faster variant designed
         | for efficiency in chat applications.             ## Reasoning
         | Models        - \*o1\*: Focused on reasoning tasks with
         | improved accuracy.        - \*o1-mini\*: A lightweight version
         | of the o1 model.        - \*o3\*: The successor to o1,
         | currently in testing phases.        - \*o3-mini\*: A lighter
         | version of the o3 model.             ## Audio Models        -
         | \*GPT-4o audio\*: Supports real-time audio interactions and
         | audio generation.        - \*Whisper\*: For transcribing and
         | translating speech to text.             ## Image Models
         | - \*DALL-E\*: Generates images from textual descriptions.
         | ## Embedding Models        - \*Embeddings\*: Converts text into
         | numerical vectors for similarity tasks.        - \*Ada\*: An
         | embedding model with various sizes (e.g., ada-002).
         | ## Additional Models        - \*Text to Speech (Preview)\*:
         | Synthesizes spoken audio from text.
         | 
         | These models are designed for various tasks, including coding
         | assistance, image generation, and conversational AI, making
         | OpenAI's offerings versatile for developers and businesses
         | alike[1][2][4][5].
         | 
         | Citations:                  [1] https://learn.microsoft.com/vi-
         | vn/azure/ai-services/openai/concepts/models        [2]
         | https://platform.openai.com/docs/models        [3]
         | https://llm.datasette.io/en/stable/openai-models.html
         | [4] https://en.wikipedia.org/wiki/OpenAI_API        [5]
         | https://industrywired.com/open-ai-models-list-top-models-to-
         | consider/        [6] https://holypython.com/python-api-
         | tutorial/listing-all-available-openai-models-openai-api/
         | [7] https://en.wikipedia.org/wiki/GPT-3        [8]
         | https://stackoverflow.com/questions/78122648/openai-api-how-do-
         | i-get-a-list-of-all-available-openai-models/78122662
        
         | ben_w wrote:
         | There's an o1-mini, there's an o3 it just hasn't gone live yet:
         | https://openai.com/12-days/#day-12
         | 
         | they can't call it o2 because:
         | https://en.wikipedia.org/wiki/The_O2_Arena
         | 
         | and the venue's sponsor: https://en.wikipedia.org/wiki/O2_(UK)
        
         | sumedh wrote:
         | o3 will come later.
         | 
         | o2 was not selected because there is already another brand with
         | that name in UK
        
       | thimabi wrote:
       | Does anyone know the current usage limits for o3-mini and
       | o3-mini-high when used through the ChatGPT interface? I tried to
       | find them on the OpenAI Knowledgebase, but couldn't find anything
       | about that.
        
         | keenmaster wrote:
         | For Plus users the limits are:
         | 
         | o3-mini-high: 50 messages per week (just like o1, but it seems
         | like these are non-shared limits, so you can have 50 messages
         | per week with o1, run out, and still have 50 messages with
         | o3-mini-high to use)
         | 
         | o3-mini: 150 messages per day
         | 
         | Source for the latter is their press release. They were more
         | vague about o3-mini-high, but people have already tested its
         | limits just by using it, and got the pop-up for 25 messages
         | left after sending 25 messages.
         | 
         | It's nice not to worry about running out of o1 messages now and
         | have a faster model that's mostly as good (potentially better
         | in some areas?). OpenAI really needs to release a middle tier
         | for 30 to $40 though that has the same models as Pro but
         | without infinite usage. I hate not having the smartest model
         | and I don't want to pay $200; there's probably a middle ground
         | where they can make as much or more money from me on a
         | subscription tier that gives limited access to o1-pro.
        
       | scarface_74 wrote:
       | This took 1:53 in o3-mini
       | 
       | https://chatgpt.com/share/679d310d-6064-8010-ba78-6bd5ed3360...
       | 
       | The 4o model without using the Python tool
       | 
       | https://chatgpt.com/share/679d32bd-9ba8-8010-8f75-2f26a792e0...
       | 
       | Trying to get accurate results with the paid version of 4o with
       | the Python interpreter.
       | 
       | https://chatgpt.com/share/679d31f3-21d4-8010-9932-7ecadd0b87...
       | 
       | The share link doesn't show the output for some reason. But it
       | did work correctly. I don't know whether the ages are correct. I
       | was testing whether it could handle ordering
       | 
       | I have no idea what conclusion I should draw from this besides
       | depending on the use case, 4o may be better with "tools" if you
       | know your domain where you are using it.
       | 
       | Tools are relatively easy to implement with LangChain or the
       | native OpenAI SDK.
        
         | margalabargala wrote:
         | The 4o model's output is blatantly wrong. I'm not going to look
         | up if it's the order or the ages that are incorrect, but:
         | 
         | 36. Abraham Lincoln - 52 years, 20 days (1861)
         | 
         | 37. James Garfield - 49 years, 105 days (1881)
         | 
         | 38. Lyndon B. Johnson - 55 years, 87 days (1963)
         | 
         | Basically everything after #15 in the list is scrambled.
        
           | scarface_74 wrote:
           | That was the point. The 4o model without using Python was
           | wrong. The o3 model worked correctly without needing an
           | external tool
        
         | BeetleB wrote:
         | I would not expect any LLM to get this right. I think people
         | have too high expectations for it.
         | 
         | Now if you asked it to write a Python program to list them in
         | order, and have it enter all the names, birthdays, and year
         | elected in a list to get the program to run - that's more
         | reasonable.
        
           | scarface_74 wrote:
           | The "o" models get the order right.
           | 
           | DeepSeek also gets the order right.
           | 
           | It doesn't show on the share link. But it actually outputs
           | the list correctly from the built in Python interpreter.
           | 
           | For some things, ChatGPT 4o will automatically use its Python
           | runtime
        
             | BeetleB wrote:
             | That some models get it right is irrelevant. In general, if
             | your instructions require _computation_ , it's safer to
             | assume it won't get it right and will hallucinate.
        
               | scarface_74 wrote:
               | The reasoning models all do pretty good at math.
               | 
               | Have you tried them?
               | 
               | This is something I threw together with o3-mini
               | 
               | https://chatgpt.com/share/679d5305-5f04-8010-b5c4-61c31e7
               | 9b2...
               | 
               | ChatGPT 4o doesn't even try to do the math internally and
               | uses its built in Python interpreter. (The [_>] link is
               | to the Python code)
               | 
               | https://chatgpt.com/share/679d54fe-0104-8010-8f1e-9796a08
               | cf9...
               | 
               | DeepSeek handles the same problem just as well using the
               | reasoning technique.
               | 
               | Of course ChatGPT 4o went completely off the rails
               | without using its Python interpreter
               | 
               | https://chatgpt.com/share/679d5692-96a0-8010-8624-b1eb091
               | 270...
               | 
               | (The break down that it got right was using Python even
               | though I told it not to)
        
       | simonw wrote:
       | I just pushed a new release of my LLM CLI tool with support for
       | the new model and the reasoning_effort option:
       | https://llm.datasette.io/en/stable/changelog.html#v0-21
       | 
       | Example usage:                 llm -m o3-mini 'write a poem about
       | a pirate and a walrus' \         -o reasoning_effort high
       | 
       | Output (comparing that with the default reasoning effort):
       | https://github.com/simonw/llm/issues/728#issuecomment-262832...
       | 
       | (If anyone has a better demo prompt I'd love to hear about it)
        
         | beklein wrote:
         | Thank you for all the effort you put into this tool and keeping
         | it up to date!
        
         | theturtle32 wrote:
         | A reasoning model is not meant for writing poetry. It's not
         | very useful to evaluate it on such tasks.
        
           | mediaman wrote:
           | It's not clear that writing poetry is a bad use case.
           | Reasoning models seem to actually do pretty well with
           | creative writing and poetry. Deepseek's R1, for example, has
           | much better poem structure than the underlying V3, and
           | writers are saying R1 was the first model where they actually
           | felt like it was a useful writing companion. R1 seems to
           | think at length about word choice, correcting structure,
           | pentameter, and so on.
        
             | 1propionyl wrote:
             | Indeed. I would assume that a reasoning model would do far
             | better at things like actually maintaining meter or rhyme
             | scheme, something that models (even with good attention
             | mechanisms) generally do very poorly at.
        
           | mquander wrote:
           | I tried to tell my English teachers that all through high
           | school but it never worked.
        
           | aprilthird2021 wrote:
           | To be blunt, an AI isn't a good tool for writing poetry
           | either. At least, not the kind people read as a high
           | literature form. For commercials, jingles, Hallmark cards,
           | etc. sure
        
           | DonHopkins wrote:
           | There exists poetry that requires a lot of mathematical
           | understanding! This is "literally" (and I mean literally in
           | the literary sense) from a Stanislaw Lem story about an
           | electronic bard, translated from Polish by Michael Kandel:
           | 
           | https://www.donhopkins.com/home/catalog/lem/WonderfulPoems.h.
           | ..
           | 
           | Prompt:
           | 
           | A love poem, lyrical, pastoral, and expressed in the language
           | of pure mathematics. Tensor algebra mainly, with a little
           | topology and higher calculus, if need be. But with feeling,
           | you understand, and in the cybernetic spirit.
           | 
           | Response:                   Come, let us hasten to a higher
           | plane,         Where dyads tread the fairy fields of Venn,
           | Their indices bedecked from one to n,         Commingled in
           | an endless Markov chain!         Come, every frustum longs to
           | be a cone,         And every vector dreams of matrices.
           | Hark to the gentle gradient of the breeze:         It
           | whispers of a more ergodic zone.              In Riemann,
           | Hilbert or in Banach space         Let superscripts and
           | subscripts go their ways.         Our asymptotes no longer
           | out of phase,         We shall encounter, counting, face to
           | face.              I'll grant thee random access to my heart,
           | Thou'lt tell me all the constants of thy love;         And so
           | we two shall all love's lemmas prove,         And in our
           | bound partition never part.              For what did Cauchy
           | know, or Christoffel,         Or Fourier, or any Boole or
           | Euler,         Wielding their compasses, their pens and
           | rulers,         Of thy supernal sinusoidal spell?
           | Cancel me not -- for what then shall remain?
           | Abscissas, some mantissas, modules, modes,         A root or
           | two, a torus and a node:         The inverse of my verse, a
           | null domain.              Ellipse of bliss, converse, O lips
           | divine!         The product of our scalars is defined!
           | Cyberiad draws nigh, and the skew mind         cuts capers
           | like a happy haversine.              I see the eigenvalue in
           | thine eye,         I hear the tender tensor in thy sigh.
           | Bernoulli would have been content to die,         Had he but
           | known such a squared cosine 2 phi!
           | 
           | From The Cyberiad, by Stanislaw Lem.
           | 
           | Translated from Polish by Michael Kandel.
           | 
           | Here's a previous discussion of Marcin Wichary's translation
           | of one of Lem's stories from Polish to English. He created
           | the Lem Google Doodle, and he stalked and met Stanislaw Lem
           | when he was a boy. Plus a discussion of Michael Kandel's
           | translation of the poetry of the Electric Bard from The First
           | Sally of Cyberiad, comparing it to machine translation:
           | 
           | https://news.ycombinator.com/item?id=28600200
           | 
           | Stanislaw Lem has finally gotten the translations his genius
           | deserves:
           | 
           | https://www.washingtonpost.com/entertainment/books/stanislaw.
           | ..
           | 
           | >Lem's fiction is filled with haunting, prescient landscapes.
           | In these reissued and newly issued translations -- some by
           | the pitch-perfect Lem-o-phile, Michael Kandel -- each
           | sentence is as hard, gleaming and unpredictable as the next
           | marvelous invention or plot twist. It's hard to keep up with
           | Lem's hyper-drive of an imagination but always fun to try.
        
       | hybrid_study wrote:
       | Is this version a prank?
        
         | RivieraKid wrote:
         | No.
        
       | simonw wrote:
       | I used o3-mini to summarize this thread so far. Here's the
       | result:
       | https://gist.github.com/simonw/09e5922be0cbb85894cf05e6d75ae...
       | 
       | For 18,936 input, 2,905 output it cost 3.3612 cents.
       | 
       | Here's the script I used to do it:
       | https://til.simonwillison.net/llms/claude-hacker-news-themes...
        
         | layman51 wrote:
         | I noticed that it thought that GoatInGrey wrote "openai is no
         | longer relevant." However, they were just quoting a different
         | user (buyucu) who was the person who first wrote that.
        
           | simonw wrote:
           | Good catch. That's likely an artifact of the way I flatten
           | the nested JSON from the comments API.
           | 
           | I originally did that to save on tokens but modern models
           | have much larger input windows so I may not need to do that
           | any more.
        
         | threecheese wrote:
         | I haven't tried o3, but one issue I struggle with in large
         | context analysis tasks is the LLMs are never thorough. In a
         | task like this thread summarization, I typically need to break
         | the document down and loop through chunks to ensure it actually
         | "reads" everything. I might have had to recurse into individual
         | conversations with some small max-depth and leaf count and run
         | inference on each, and then have some aggregation at the end,
         | otherwise it would miss a lot (or appear to, based on the
         | output).
         | 
         | Is this a case of PEBKAC?
        
           | syntaxing wrote:
           | Depending on what you're trying to do, it's worth trying the
           | 1M context Qwen Models. They only released 7 and 14B so it's
           | "intelligence" is limited but should be more than capable for
           | coherent summary.
        
           | andrewci wrote:
           | Are there any tools you use to do this chunking? Or is this a
           | custom workflow? I've noticed the same thing both on
           | copy/paste text and uploaded documents when using the LLM
           | chat tools.
        
           | sdesol wrote:
           | > I haven't tried o3, but one issue I struggle with in large
           | context analysis tasks is the LLMs are never thorough.
           | 
           | o3 does look very promising with regards to large context
           | analysis. I used the same raw data and ran the same prompt as
           | Simon for GPT-4o, GPT-4o mini and DeepSeek R1 and compared
           | their output. You can find the analysis below:
           | 
           | https://beta.gitsense.com/?chat=46493969-17b2-4806-a99c-5d93.
           | ..
           | 
           | The o3-min model was quite thorough. With reasoning models,
           | it looks like dealing with long context might have gotten a
           | lot better.
           | 
           | Edit:
           | 
           | I was curious if I could get R1 to be more thorough and got
           | the following interesting tidbits.
           | 
           | - Depth Variance: R1 analysis provides more technical
           | infrastructure insights, while o3-mini focuses on developer
           | experience
           | 
           | - Geopolitical Focus: Only R1 analysis addresses China-West
           | tensions explicitly
           | 
           | - Philosophical Scope: R1 contains broader industry meta-
           | commentary absent in o3-mini
           | 
           | - Contrarian Views: o3-mini dedicates specific section to
           | minority opinions
           | 
           | - Temporal Aspects: R1 emphasizes future-looking questions,
           | o3-mini focuses on current implementation
           | 
           | You can find the full analysis at
           | 
           | https://beta.gitsense.com/?chat=95741f4f-b11f-4f0b-8239-83c7.
           | ..
        
           | mvkel wrote:
           | o1-pro is incredibly good at this. You'll be amazed
        
           | scarface_74 wrote:
           | Try Google's NotebookLM
        
         | breakingcups wrote:
         | It's definitely making some errors
        
           | kandesbunzler wrote:
           | Like?
        
             | aprilthird2021 wrote:
             | Even though it was told that it MUST quote users directly,
             | it still outputs:
             | 
             | > It's already a game changer for many people. But to have
             | so many names like o1, o3-mini, GPT-4o, & GPT-4o-mini
             | suggests there may be too much focus on internal tech
             | details rather than clear communication." (paraphrase based
             | on multiple similar sentiments)
             | 
             | It also hallucinates quotes.
             | 
             | For example:
             | 
             | > "I'm pretty sure 'o3-mini' works better for that purpose
             | than 'GPT 4.1.3'." - TeMPOraL
             | 
             | But that comment is not in the user TeMPOraL's comment
             | history.
             | 
             | Sentiment analysis is also faulty.
             | 
             | For example:
             | 
             | > "I'd bet most users just 50/50 it, which actually makes
             | it more remarkable that there was a 56% selection rate." -
             | jackbrookes - This quip injects humor into an otherwise
             | technical discussion about evaluation metrics.
             | 
             | It's not a quip though. That comment was meant in earnest
        
               | romanhn wrote:
               | That's funny, the quote exists, but it got the user
               | wrong.
        
         | Eduard wrote:
         | 3.3612 cents (I guess USD cents) is expensive!
        
           | BoorishBears wrote:
           | Same immediate thought: the free option I provide on my
           | production site is a model that runs on 2xA40. That's 96GB of
           | VRAM for 78 cents an hour serving at least 4 or 5 concurrent
           | requests at any given time.
           | 
           | O3 Mini is probably not a very large model and OpenAI has
           | layers upon layers of efficiencies, so they must be making an
           | absolute _killing_ charging 3.3 cents for a few seconds of
           | compute
        
         | largbae wrote:
         | Thanks for sharing this! And no apparent self-awareness! OpenAI
         | has come a long way from the Sydney days:
         | https://answers.microsoft.com/en-us/bing/forum/all/this-ai-c...
        
         | Valakas_ wrote:
         | For those that like simpler ways (although dependent on Google)
         | NotebookLM does all that in 2 clicks. And you can ask it
         | questions about it, references are provided.
        
           | simonw wrote:
           | After you've run my hn-summary.sh script you can ask follow
           | up questions like this:                 llm -c "did anyone
           | talk about pricing?"
        
         | tkgally wrote:
         | Borrowing most of Simon's prompt, I tried the following with
         | o3-mini-high in the chat interface with Search turned on:
         | 
         | "Summarize the themes of the opinions expressed in discussions
         | on Hacker News on January 31 and February 1, 2025, about
         | OpenAI's release od [sic] ChatGPT o3-mini. For each theme,
         | output a header. Include direct "quotations" (with author
         | attribution) where appropriate. You MUST quote directly from
         | users when crediting them, with double quotes. Fix HTML
         | entities. Go long. Include a section of quotes that illustrate
         | opinions uncommon in the rest of the piece"
         | 
         | The result is here:
         | 
         | https://chatgpt.com/share/679d790d-df6c-8011-ad78-3695c2e254...
         | 
         | Most of the cited quotations seem to be accurate, but at least
         | one (by uncomplexity_) does not appear in the named commenter's
         | comment history.
         | 
         | I haven't attempted to judge how accurate the summary is. Since
         | the discussions here are continuing at this moment, this
         | summary will be gradually falling out of date in any case.
        
         | thousand_nights wrote:
         | good morning!
        
       | jajko wrote:
       | A random idea - train one of those models on _you_ , keep it
       | aside, let it somehow work out your intricacies, moods, details,
       | childhood memories, personality, flaws, strengths. Methods can be
       | various - initial dump of social networks, personal photos and
       | videos, maybe some intense conversation to grok rough you, then
       | polish over time.
       | 
       | A first step to digital immortality, could be a nice startup of
       | some personalized product for rich, and then even regular folks.
       | Immortality not in ourselves as meat bags of course, we die
       | regardless, but digital copy and memento that our children can
       | use if feeling lonely and can carry with themselves anywhere, or
       | later descendants out of curiosity to hold massive events like
       | weddings. One could 'invite' long lost ancestors. Maybe your
       | grand-grand father would be a cool guy you could easily click
       | with these days via verbal input. Heck even 3D detailed model.
       | 
       | An additional service, 'perpetually' paid - keeping your data
       | model safe, taking care of it, backups, heck even maybe give it a
       | bit of computing power to to receive current news in some light
       | fashion and evolve, could be extras. Different tiers for
       | different level of services and care.
       | 
       | Or am I decade or two ahead? I can see this as universally
       | interesting across many if not all cultures.
        
       | silverlake wrote:
       | O3-mini solved this prompt. DeepSeek R1 had a mental breakdown.
       | The prompt: "Bob is facing forward. To his left is Ann, to his
       | right is Cathy. Ann and Cathy are facing backwards. Who is on
       | Ann's left?"
        
         | modeless wrote:
         | R1 or R1-Distill? They are not the same thing. I think DeepSeek
         | made a mistake releasing them at the same time and calling them
         | all R1.
         | 
         | Full R1 solves this prompt easily for me.
        
           | silverlake wrote:
           | I used R1 hosted at NVidia here:
           | https://build.nvidia.com/deepseek-ai/deepseek-r1/modelcard
        
             | SparkyMcUnicorn wrote:
             | Same. With the recommended settings, it got it right. I
             | regenerated a bunch of times, and it did suggest Cathy once
             | or twice.
             | 
             | R1 70b also got it right just as many times for me.
        
             | modeless wrote:
             | Huh, that one got it wrong for me too. I don't have
             | patience to try it 10 times each to see if it was a
             | coincidence, but it is absolutely true that not all
             | implementations of LLMs produce the same outputs. It is in
             | fact common for subtle bugs to happen that cause the
             | outputs to be worse but not catastrophically bad, and
             | therefore go unnoticed. So I wouldn't trust any
             | implementation but the original for benchmarking or even
             | general use unless I tested it extensively.
        
         | mark_l_watson wrote:
         | Thanks for that example! Using 'ollama run' these models all
         | failed: deepseek-r1:32b, mistral-small:latest, qwq:latest, and
         | qwen2.5:14b.
        
         | lynguist wrote:
         | Really interesting puzzle. I tried the different Chatgpts and
         | they all solved it. I tried Deepseek and Deepseek R1, and
         | Deepseek did have this meltdown:
         | 
         | Let's break down the problem step by step to understand the
         | relationships and positions of Bob, Ann, and Cathy. 1.
         | Understanding the Initial Setup                   Bob is facing
         | forward.              This means Bob's front is oriented in a
         | particular direction, which we'll consider as the reference
         | point for "forward."              To his left is Ann, to his
         | right is Cathy.              If Bob is facing forward, then:
         | Ann is positioned to Bob's left.                  Cathy is
         | positioned to Bob's right.              Ann and Cathy are
         | facing backwards.              Both Ann and Cathy are oriented
         | in the opposite direction to Bob. If Bob is facing forward,
         | then Ann and Cathy are facing backward.
         | 
         | 2. Visualizing the Positions
         | 
         | To better understand the scenario, let's visualize the
         | positions: Copy
         | 
         | Forward Direction: |
         | 
         | Bob (facing forward) | | Ann (facing backward) | / | / | / | /
         | | / | / | / |/ |
         | 
         | And then only the character | in a newline forever.
        
         | Synaesthesia wrote:
         | Deepseek solved it.
        
         | thinkalone wrote:
         | That's a fun, simple test! I tried a few models, and mistral-
         | nemo gets it every time, even when run locally without any
         | system prompt! https://build.nvidia.com/nv-mistralai/mistral-
         | nemo-12b-instr...
        
         | sivakon wrote:
         | deepseek answered it.
         | 
         | https://leaflet.pub/63ae4881-d726-4388-9ba3-8a5f86947443
        
       | EternalFury wrote:
       | o1-preview, o1, o1-mini, o3-mini, o3-mini (low), o3-mini
       | (medium), o3-mini (high)...
       | 
       | What's next?
       | 
       | o4-mini (wet socks), o5-Eeny-meeny-miny-moe?
       | 
       | I thought they had a product manager over there.
       | 
       | They only need 2 names, right? ChatGPT and o.
       | 
       | ChatGPT-5 and o4 would be next.
       | 
       | This multiplication of the LLM loaves and fishes is kind of
       | silly.
        
       | mark_l_watson wrote:
       | Oh, sweet: both o3-mini low and high support integrated web
       | search. No integrated web search with o1.
       | 
       | I prefer, for philosophical reasons, open weight and open
       | process/science models, but OpenAI has done a very good job at
       | productizing ChatGPT. I also use their 4o-mini API because it is
       | cheap and compares well to using open models on Groq Cloud. I
       | really love running local models with Ollama but the API venders
       | keep the price so low that I understand most people not wanting
       | the hasssle if running Deepseek-R, etc., locally.
        
       | swyx wrote:
       | for those interested, updated my o3-mini price chart to compare
       | the cost-intelligence frontier with deepseek:
       | https://x.com/swyx/status/1885432031896887335
        
       | bix6 wrote:
       | They use the word reasoning a lot in the post. Is this reasoning
       | or statistical prediction?
        
       | jokoon wrote:
       | Does that mean I can use this on my recent gaming AMD gpu?
        
       | rednafi wrote:
       | The most important detail for me was that in coding, it's weaker
       | than 4o and stronger than o1-mini. So I don't have much use for
       | it.
        
         | usaar333 wrote:
         | Where are you reading that?
        
       | xmichael909 wrote:
       | So can I ditch the $200 a month o1 pro account, and go back to
       | the $20 account with 03-mini?
        
         | genidoi wrote:
         | With o1 pro you're paying for unlimited compute that you don't
         | get with $20 + capped o1.
        
       | modeless wrote:
       | Initial vibes are not living up to the hype. It fails my pet
       | prompt, and the Cursor devs say they still prefer Sonnet[1]. I'm
       | sure it will have its uses but it is not going to dominate.
       | 
       | [1] https://x.com/cursor_ai/status/1885415392677675337
        
       | diegocg wrote:
       | I hope chatgpt reconsiders the naming of their models some time.
       | I have troubles deciding which model is the one I should use.
        
         | esafak wrote:
         | They release models too often for a new one to be better at
         | everything, so you have to pick the right one for your task.
        
           | niek_pas wrote:
           | And that's exactly where good, recognizable branding comes
           | in.
        
       | badgersnake wrote:
       | 56% is pretty close to 'don't give a toss'
        
       | mvdtnz wrote:
       | Wake up honey a new lie generator just dropped.
        
       | sourcecodeplz wrote:
       | Even for free users, that is nice
        
       | vok wrote:
       | Well, o3-mini-high just successfully found the root cause of a
       | seg fault that o1 missed: mistakenly using _mm512_store_si512 for
       | an unaligned store that should have been _mm512_storeu_si512.
        
         | nextworddev wrote:
         | rip development jobs /s.. or not /s
        
         | throw83288 wrote:
         | How do I avoid the angst about this stuff as a student in
         | computer science? I love this field but frankly I've been at a
         | loss since the rapid development of these models.
        
           | jumploops wrote:
           | LLMs are the new compilers.
           | 
           | As a student, you should continue to focus on fundamentals,
           | but also adapt LLMs into your workflow where you can.
           | 
           | Skip writing the assembly (now curly braces and semicolons),
           | and focus on what the software you're building actually does,
           | who it serves, and how it works.
           | 
           | Programming is both changing a lot, and not at all. The
           | mechanics may look different, but the purpose is still the
           | same: effectively telling computers what to do.
        
           | abdullahkhalids wrote:
           | As a former prof. What you should be learning from any STEM
           | degree (and many other degrees as well) is to think clearly,
           | rigorously, creatively, and with discipline, etc. You also
           | need to learn the skill of learning content and skills
           | quickly.
           | 
           | The specific contents or skills of your degree don't matter
           | that much. In pretty much any STEM field, over the last
           | 100ish years, whatever you learned in your undergraduate was
           | mostly irrelevant by the time you retired.
           | 
           | Everyone got by, by staying on top of the new developments in
           | the field and doing them. With AI, the particular skills
           | needed to use the power of computers to do things in the
           | world have changed. Just learn those skills.
        
           | danparsonson wrote:
           | For all the value that they bring, there is still a good dose
           | of parlour tricks and toy examples around, and they need an
           | intelligent guiding hand to get the best out of them. As a
           | meat brain, you can bring big picture design skills that the
           | bots don't have, keeping them on track to deliver a coherent
           | codebase, and fixing the inevitable hallucinations. Think of
           | it like having a team of optimistic code monkeys with
           | terrible memory, and you as the coordinator. I would focus on
           | building skills in things like software design/architecture,
           | requirements gathering (what do people want and how do you
           | design software to deliver it?), in-depth hardware knowledge
           | (how to get the best out of your platform), good API design,
           | debugging, etc. Leave the CRUD to the robots and be the
           | brain.
        
           | mhh__ wrote:
           | It's either over, or giving a lot of idiots false confidence
           | -- I meet people somewhat regularly who believe they don't
           | really need to know what they're doing any more. This is
           | probably an arbitrage.
        
           | raincole wrote:
           | Angst?
           | 
           | It just means you're less likely be fixing someone else's
           | "mistakenly _mm512_store_si512 for been _mm512_storeu_si512"
           | error because AI fix(ed) it for you and you can focus on
           | other parts of computer science. Computer science surely
           | isn't just fixing _mm512_store_si512.
        
         | jiocrag wrote:
         | why is this impressive at all? It effectively amounts to
         | correcting a typo.
        
       | sandos wrote:
       | How many benchmarks for LLMs are there out there?
       | 
       | Is there any evidence of over-fitting on benchmarks, or is there
       | truely hidden parts to them?
        
       | ern wrote:
       | I haven't bothered with o3 mini, because who wants an "inferior"
       | product? I was using 4o as a "smarter Google" until DeepSeek
       | appeared (although its web search is being hammered now and I'm
       | just using Google ).
       | 
       | o1 seems to have been neutered in the last week lots of
       | disclaimers and butt-covering in its responses.
       | 
       | I also had an annoying discussion with o1 about the DC plane
       | crash..it doesn't have web access and its cutoff is 2024, so I
       | don't expect it know about the crash. However, after saying such
       | an event is extremely unlikely and being almost patronisingly
       | reassuring, it treated pasted news articles and links (which to
       | be sure, it can't access) as "fictionalized", instead of
       | acknowledging its own cut-off date, and that it could have been
       | wrong. In contrast DeepSeek (with web search turned off) was less
       | dismissive of the risks in DC airspace, and more aware of its own
       | knowledge cut-off.
       | 
       | Coupled with the limited number of o1 responses for ChatGPT Plus,
       | I've cancelled my subscription for now.
        
       | danielovichdk wrote:
       | I read this as a full on marketing note targeted towards software
       | developers.
        
       | profsummergig wrote:
       | Can someone please share the logic behind their version naming
       | convention?
        
       | cyounkins wrote:
       | I switched an agent from Sonnet V2 to o3-mini (default medium
       | mode) and got strangely poor results: only calling 1 tool at a
       | time despite being asked to call multiple, not actually doing any
       | work, and reporting that it did things it didn't
        
       | binary132 wrote:
       | Not really impressed by the answers I just got.
        
       | sshh12 wrote:
       | I built a silly political simulation game with this:
       | https://state.sshh.io/
       | 
       | https://github.com/sshh12/state-sandbox
        
       | n0id34 wrote:
       | Is AI fizzing out or just me? I feel like they're trying to smash
       | out new models as fast as they can but in reality they're barely
       | any different, it's turning into the smartphone market. New
       | iPhone with a slightly better camera and slightly differently
       | bevelled edges, get it NOW! But doesn't actually do anything
       | better than the iPhone 6.
       | 
       | Claude, GPT 4 onwards, and DeepSeek all feel the same to me. Okay
       | to a point, then kinda useless. More like a more convenient
       | specialised Google that you need to double check the results of.
        
         | nextworddev wrote:
         | on the contrary, it's accelerating since they unlocked a new
         | paradigm of scaling
        
         | lordofgibbons wrote:
         | Boiling frog. The advances are happening so rapidly, but
         | incrementally, that it's not being registered. It just seems
         | like the normal state.
         | 
         | Compare LLMs from a year or two ago with the ones out today on
         | practically any task. It's night and day difference.
         | 
         | This is specially so when you start taking into account these
         | "reasoning" models. It's mind blowing how much better they are
         | than "non-reasoning" models for tasks like planning and coding.
         | 
         | https://aider.chat/docs/leaderboards/#aider-polyglot-benchma...
        
       | Alifatisk wrote:
       | Any comparison with other models yet?
        
       | antirez wrote:
       | Just tested two complicated coding tasks, and surprisingly
       | o3-mini-high nailed it while Sonnet 3.5 failed it. Will do more
       | tests tomorrow.
        
       | lenerdenator wrote:
       | No self-host, no care.
        
       | AutistiCoder wrote:
       | the o3-mini model would be useful to me if coding's the only
       | thing I need to do in a chat log.
       | 
       | When I use ChatGPT these days, it's to help me write coding
       | videos and then the social media posts around those videos. So
       | that's two specialties in one chat log.
        
       | anotherpaulg wrote:
       | For AI coding, o3-mini scored similarly to o1 at 10X less cost on
       | the aider polyglot benchmark [0]. This comparison was with both
       | models using high reasoning effort. o3-mini with medium effort
       | scored in between R1 and Sonnet.                 62% $186 o1 high
       | 60%  $18 o3-mini high       57%   $5 DeepSeek R1       54%   $9
       | o3-mini medium       52%  $14 Sonnet       48%   $0 DeepSeek V3
       | 
       | [0] https://aider.chat/docs/leaderboards/
        
         | throw83288 wrote:
         | What do you expect to come from full o3 in terms of automating
         | software engineering?
        
           | nextworddev wrote:
           | o3 (high) might score 80%+
        
         | stavros wrote:
         | Do you have plans to try o3-mini-high as the architect and
         | Sonnet as the model?
        
       | mvkel wrote:
       | I've been using cursor since it launched, sticking almost
       | exclusively to claude-3.5-sonnet because it is incredibly
       | consistent, and rarely loses the plot.
       | 
       | As subsequent models have been released, most of which claim to
       | be better at coding, I've switched cursor to it to give them a
       | try.
       | 
       | o1, o1-pro, deepseek-r1, and the now o3-mini. All of these models
       | suffer from the exact same "adhd." As an example, in a NextJS
       | app, if I do a composer prompt like "on page.tsx [15 LOC], using
       | shadcn components wherever possible, update this page to have a
       | better visual hierarchy."
       | 
       | sonnet nails it almost perfectly every time, but suffers from
       | some date cutoff issues like thinking that shadcn-ui@latest is
       | the repo name.
       | 
       | Every single other model, doesn't matter which, does the
       | following: it starts writing (from scratch), radix-ui components.
       | I will interrupt it and say "DO NOT use radix-ui, use shadcn!" --
       | it will respond with "ok!" then begin writing its own components
       | from scratch, again not using shadcn.
       | 
       | This is still problematic with o3-mini.
       | 
       | I can't believe it's the models. It must be the instruction-set
       | that cursor is giving it behind the scenes, right? No amount of
       | .cursorrules, or other instruction, seems to get cursor "locked
       | in" the way sonnet just seems to be naturally.
       | 
       | It sucks being stuck on the (now ancient) sonnet, but
       | inexplicably, it remains the only viable coding option for me.
       | 
       | Has anyone found a workaround?
        
         | jvanderbot wrote:
         | I've found cursor to be too thin a wrapper. Aider is somehow
         | significantly more functional. Try that.
        
           | dhc02 wrote:
           | Aider, with o1 or R1 as the architect and Claude 3.5 as the
           | implementer, is so much better than anything you can
           | accomplish with a single model. It's pretty amazing. Aider is
           | at least one order of magnitude more effective for me than
           | using the chat interface in Cursor. (I still use Cursor for
           | quick edits and tab completions, to be clear).
        
             | ChadNauseam wrote:
             | I normally use aider by just typing in what I want and it
             | magically does it. How do I use o1 or R1 to play the role
             | of the "architect"?
        
               | macNchz wrote:
               | You can start it with something like:
               | aider --architect --model o1 --editor-model sonnet
               | 
               | Then you'll be in "architect" mode, which first prompts
               | o1 to design the solution, then you can accept it and
               | allow sonnet to actually create the diffs.
               | 
               | Most of the time your way works well--I use sonnet alone
               | 90% of the time, but the architect mode is really great
               | at getting it unstuck when it can't seem to implement
               | what I want correctly, or keeps fixing its mistakes by
               | making things worse.
        
               | cruffle_duffle wrote:
               | I really want to see how apps created this way scale to
               | large codebases. I'm very skeptical they don't turn into
               | spaghetti messes.
               | 
               | Coding is basically just about the most precise way to
               | encapsulate a problem as a solution possible. Taking a
               | loose English description and expanding it into piles of
               | code is always going to be pretty leaky no matter how
               | much these models spit out working code.
               | 
               | In my experience you have to pay a lot of attention to
               | every single line these things write because they'll
               | often change stuff or more often make wrong assumptions
               | that you didn't articulate. And in my experience they
               | _never_ ask you questions unless you specifically prompt
               | them to (and keep reminding them to), which means they
               | are doing a hell of a lot of design and implementation
               | that unless carefully looked over will ultimately be
               | wrong.
               | 
               | It really reminds me a bit of when Ruby on Rails came out
               | and the blogosphere was full of gushing "I've never been
               | more productive in my life" posts. And then you find out
               | they were basically writing a TODO app and their previous
               | development experience was doing enterprise Java for some
               | massive non-tech company. Of course RoR will be a breath
               | of fresh air for those people.
               | 
               | Don't get me wrong I use cursor as my daily driver but I
               | am starting to find the limits for what these things can
               | do. And the idea of having two of these LLM's taking some
               | paragraph long feature description and somehow chatting
               | with each other to create a scalable bit of code that
               | fits into a large or growing codebase... well I find that
               | kind of impossible. Sure the code compiles and conforms
               | to whatever best practices are out there but there will
               | be absolutely no constancy across the app--especially at
               | the UX level. These things simply cannot hold that kind
               | of complexity in their head and even if they could part
               | of a developers job is to translate loose English into
               | code. And there is much, much, much, much more to that
               | than simply writing code.
        
               | macNchz wrote:
               | I see what you're saying and I think that terming this
               | "architect" mode has an implication that it's more
               | capable than it really is, but ultimately this two model
               | pairing is mostly about combining disparate abilities to
               | separate the "thinking" from the diff generation. It's
               | very effective in producing better results for a single
               | prompt, but it's not especially helpful for
               | "architecting" a large scale app.
               | 
               | That said, in the hands of someone who is competent at
               | assembling a large app, I think these tools can be
               | incredibly powerful. I have a business helping companies
               | figure out how/if to leverage AI and have built a bunch
               | of different production LLM-backed applications _using_
               | LLMs to write the code over the past year, and my
               | impression is that there is very much something there.
               | Taking it step by step, file by file, like you might if
               | you wrote the code yourself, describing your concept of
               | the abstractions, having a few files describing the
               | overall architecture that you can add to the chat as
               | needed--little details make a big difference in the
               | results.
        
               | tribeca18 wrote:
               | I use Cursor and Composer in agent mode on a daily basis,
               | and this is basically exactly what happened to me.
               | 
               | After about 3 weeks, things were looking great - but lots
               | of spagetti code was put together, and it never told me
               | what I didn't know. The data & state management
               | architecture I had written was simply just not
               | maintainable (tons of prop drilling, etc). Over time, I
               | basically learned common practices/etc and I'm finding
               | that I have to deal with these problems myself. (how it
               | used to be!)
               | 
               | We're getting close - the best thing I've done is create
               | documentation files with lots of descriptions about the
               | architecture/file structure/state
               | management/packages/etc, but it only goes so far.
               | 
               | We're getting closer, but for right now - we're not there
               | and you have to be really careful with looking over all
               | the changes.
        
             | dwaltrip wrote:
             | I haven't tried aider in quite a while, what does it mean
             | to use one model as an architect and another as the
             | implementer?
        
               | Terretta wrote:
               | _Aider now has experimental support for using two models
               | to complete each coding task:_
               | 
               |  _- An Architect model is asked to describe how to solve
               | the coding problem._
               | 
               |  _- An Editor model is given the Architect's solution and
               | asked to produce specific code editing instructions to
               | apply those changes to existing source files._
               | 
               |  _Splitting up "code reasoning" and "code editing" in
               | this manner has produced SOTA results on aider's code
               | editing benchmark. Using o1-preview as the Architect with
               | either DeepSeek or o1-mini as the Editor produced the
               | SOTA score of 85%. Using the Architect /Editor approach
               | also significantly improved the benchmark scores of many
               | models, compared to their previous "solo" baseline scores
               | (striped bars)._
               | 
               | https://aider.chat/2024/09/26/architect.html
        
               | lukas099 wrote:
               | Probably gonna show a lot of ignorance here, but isn't
               | that a big part of the difference between our brains and
               | AI? That instead of one system, we are many systems that
               | are kind of sewn together? I secretly think AGI will just
               | be a bunch of different specialized AIs working together.
        
               | Terretta wrote:
               | You're in good company in that secret thought.
               | 
               | Have a look at this:
               | https://en.wikipedia.org/wiki/Society_of_Mind
        
             | aledalgrande wrote:
             | Same with Cline
        
         | zackproser wrote:
         | Not trying to be snarky, but the example prompt you provided is
         | about 1/15th the length and detail of prompts I usually send
         | when working with Cursor.
         | 
         | I tend to exhaustively detail what I want, including package
         | names and versions because I've been to that movie before...
        
           | jwpapi wrote:
           | Yes this works good for me too rather take your time and do
           | the first prompt right
        
           | inerte wrote:
           | What works nice also is the text to speech. I find it easier
           | and faster to give more context by talking rather than
           | typing, and the extra content helps the AI to do its job.
           | 
           | And even though the speech recognition fails a lot on some of
           | the technical terms or weirdly named packages, software, etc,
           | it still does a good job overall (if I don't feel like
           | correcting the wrong stuff).
           | 
           | It's great and has become somewhat of a party trick at work.
           | Some people don't even use AI to code that often, and when I
           | show them "hey have you tried this?" And just _tell_ the
           | computer what I want? Most folks are blown away.
        
             | cadence- wrote:
             | Does the Cursor have text-to-speech functionality?
        
             | fud101 wrote:
             | you mean speech to text right?
        
               | chefandy wrote:
               | Not for me. I first ask Advanced Voice to read me some
               | code and have Siri listen and email it to an API I wrote
               | which uses Claude to estimate the best cloud provider to
               | run that code based on its requirements and then a n8n
               | script deploys it and send me the results via twilio.
        
           | mvkel wrote:
           | My point was that a prompt that simple could be held and
           | executed very well by sonnet, but all other models
           | (especially reasoning models) crash and burn.
           | 
           | It's a 15 line tsx file so context shouldn't be an issue.
           | 
           | Makes me wonder if reasoning models are really proper models
           | for coding in existing codebases
        
             | liamwire wrote:
             | Your last point matches what I've seen some people
             | (simonw?) say they're doing currently: using aider to work
             | with two models--one reasoning model as an architect, and
             | one standard LLM as the actual coder. Surprisingly, the
             | results seem pretty good vs. putting everything on one
             | model.
        
               | mvkel wrote:
               | This is probably the right way to think about it. O1-pro
               | is an absolute monster when it comes to architecture. It
               | is staggering the breadth and depth that it sees. Ask it
               | to actually implement though, and it trips over its
               | shoelaces almost immediately.
        
               | goosejuice wrote:
               | Can you give an example of this monstrous capability you
               | speak of? What have you used it for professionally w.r.t.
               | architecture.
        
           | crooked-v wrote:
           | If have to write a prompt that long, it'll be faster to just
           | write the code.
        
             | aprilthird2021 wrote:
             | Shocking to see this because this was essentially the
             | reason most of the previous no code solutions never took
             | off...
        
           | esperent wrote:
           | That sounds exhausting. Wouldn't it be faster to include you
           | package.json in the context?
           | 
           | I sometimes do this (using Cline), plus create a .cline file
           | at project root which I refine over time and which describes
           | both the high level project overview, details of the stack
           | I'm using, and technical details I want each prompt to
           | follow.
           | 
           | Then each actual prompt can be quite short: _read files x, y,
           | and z, and make the following changes..._ where I keep the
           | changes concise and logically connected - basically what I
           | might do for a single pull request.
        
           | hombre_fatal wrote:
           | You're basically saying you write 15x the prompt for the same
           | result they get with sonnet.
        
         | kace91 wrote:
         | My experience with cursor and sonnet is that it is relatively
         | good at first tries, but completely misses the plot during
         | corrections.
         | 
         | "My attempt at solving the problem contains a test that fails?
         | No problem, let me mock the function I'm testing, so that,
         | rather than actually run, it returns the expected value!"
         | 
         | It keeps doing that kind of shenanigans, applying modifications
         | that solve the newly appearing problem while screwing the
         | original attempt's goal.
         | 
         | I usually get much better results from regular chatgpt copying
         | and pasting, the trouble being that it is a major pain to
         | handle the context window manually by pasting relevant info and
         | reminding what I think is being forgotten.
        
           | jwpapi wrote:
           | Yes it's usually worth it to try to write a really good first
           | prompt
        
             | earleybird wrote:
             | More than once I've found myself going down this 'little
             | maze of twisty passages, all alike'. At some point I stop,
             | collect up the chain of prompts in the conversation, and
             | curate them into a net new prompt that should be a bit
             | better. Usually I make better progress - at least for a
             | while.
        
               | dr_dshiv wrote:
               | Why is it so hard to share/find prompts or distill my own
               | damn prompts? There must be good solutions for this --
        
               | whall6 wrote:
               | Don't outsource the only thing left for our brains to do
               | themselves :/
        
               | garfij wrote:
               | What do you find difficult about distilling your own
               | prompts?
               | 
               | After any back and forth session I have reasonably good
               | results asking something like "Given this workflow, how
               | could I have prompted this better from the start to get
               | the same results?"
        
               | SamPatt wrote:
               | This becomes second nature after a while. I've developed
               | an intuition about when a model loses the plot and when
               | to start a new thread. I have a base prompt I keep for
               | the current project I'm working on, and then I ask the
               | model to summarize what we've done in the thread and
               | combine them to start anew.
               | 
               | I can't wait until this is a solved problem because it
               | does slow me down.
        
           | hahajk wrote:
           | Can't you select Chatgpt as the model in cursor?
        
             | kace91 wrote:
             | Yes, but for some reason it seems to perform worse there.
             | 
             | Perhaps whatever algorithms Cursor uses to prepare the
             | context it feeds the model are a good fit for Claude but
             | not so much for the others (?). It's a random guess, but
             | whatever the reason, there's a weird worsening of
             | performance vs pure chat.
        
             | electroly wrote:
             | Yes but every model besides claude-3.5-sonnet sucks in
             | Cursor, for whatever reason. They might as well not even
             | offer the other models. The other models, even "smarter"
             | models, perform vastly poorer or don't support agent
             | capability or both.
        
           | delichon wrote:
           | Claude makes a lot of crappy change suggestions, but when you
           | ask "is that a good suggestion?" it's pretty good at judging
           | when it isn't. So that's become standard operating procedure
           | for me.
           | 
           | It's difficult to avoid Claude's strong bias for being
           | agreeable. It needs more HAL 9000.
        
             | 4b11b4 wrote:
             | I'm always asking Claude to propose a variety of
             | suggestions for the problem at hand and their trade-offs,
             | then evaluating them for the top three proposals and why.
             | Then I'll pick one of them and further vet the idea
        
             | esperent wrote:
             | > when you ask "is that a good suggestion?" it's pretty
             | good at judging when it isn't
             | 
             | Basically a poor man's COT.
        
           | mathieuh wrote:
           | Hah, I was trying it the other day in a Go project and it did
           | exactly the same thing. I couldn't believe my eyes, it
           | basically rewrote all the functions back out in the test file
           | but modified slightly so the thing that was failing wouldn't
           | even run.
        
           | sheepscreek wrote:
           | For my advanced use case involving Python and knowledge of
           | finance, Sonnet fared poorly. Contrary to what I am reading
           | here, my favorite approach has been to use o1 in agent mode.
           | It's an absolute delight to work with. It is like I'm working
           | with a capable peer, someone at my level.
           | 
           | Sadly there are some hard limits on o1 with Cursor and I
           | cannot use it anymore. I do pay for their $20/month
           | subscription.
        
             | electroly wrote:
             | > o1 in agent mode
             | 
             | How? It specifically tells me this is unsupported: "Agent
             | composer is currently only supported using Anthropic models
             | or GPT-4o, please reselect the model and try again."
        
         | energy123 wrote:
         | Context length possibly. Prompt adherence drops off with
         | context, and anything above 20k tokens is pushing it. I get the
         | best results by presenting the smallest amount of context
         | possible, including removing comments and main methods and
         | functions that it doesn't need to see. It's a bit more work
         | (not _that_ much if you have a script that does it for you),
         | but the results are worth it. You could test in the chatgpt app
         | (or lmarena direct chat) where you ask the same question but
         | with minimal hand curated context, and see if it makes the same
         | mistake.
        
           | mvkel wrote:
           | If it's a context issue, it's an issue with how cursor itself
           | sends the context to these reasoning LLMs.
           | 
           | Context alone shouldn't be the reason that sonnet succeeds
           | consistently, but others (some which have even bigger context
           | windows) fail.
        
             | energy123 wrote:
             | Yes, that's what I'm suggesting. Cursor is spamming the
             | models with too much context, which harms reasoning models
             | more than it harms non-reasoning models (hypothesis, but
             | one that aligns with my experience). That's why I
             | recommended testing reasoning models outside of Cursor with
             | a hand curated context.
             | 
             | The advertised context length being longer doesn't
             | necessarily map 1:1 with the actual ability the models have
             | to perform difficult tasks over that full context. See for
             | example the plots of performance on ARC vs context length
             | for o-series models.
        
         | esperent wrote:
         | Claude uses Shadcn-ui extensively in the web interface, to the
         | point where I think it's been trained to use it over other UI
         | components.
         | 
         | So I think you got lucky and you're asking it to write using a
         | very specific code library that it's good at, because it
         | happens to use it for it's main userbase on the web chat
         | interface.
         | 
         | I wonder if you were using a different component library, or
         | using Svelte instead of React, would you still find Claude the
         | best?
        
         | MaxLeiter wrote:
         | We've been working on solving a lot of these issues with v0.dev
         | (disclaimer: shadcn and I work on it). We do a lot of pre and
         | post-processing to ensure LLMs output valid shadcn code.
         | 
         | We're also talking to the cursor/windsurf/zed folks on how we
         | can improve Next.js and shadcn in the editors (maybe something
         | like llms.txt?)
        
           | mvkel wrote:
           | Thanks for all the work you do! v0 is magical. I absolutely
           | love the feature where I can add a chunky component that v0
           | made to my repo with npx
        
         | kristopolous wrote:
         | "not" and other function words; _usually_ work fine today but
         | if I 'm having trouble, the best thing to do is probably be
         | inclusive, not exclusive.
        
         | eagleinparadise wrote:
         | Cursor is also very user-unfriendly in providing alternative
         | models to use in composer (agent). There's a heavy reliance on
         | Anthrophic for cursor.
         | 
         | Try using Gemini thinking with Cursor. It barely works. Cmd-k
         | outputs the thinking into the code. Its unusable in chat
         | because the formatting sucks.
         | 
         | Is there some relationship between Cursor and Anthropic, i
         | wonder. Plenty of other platforms seem very eager to give users
         | model flexibility, but Cursor seems to be lacking.
         | 
         | I could be wrong, just an observation.
        
         | OkGoDoIt wrote:
         | I also have been less impressed by o1 in cursor compared to
         | sonnet 3.5. Usually what I will do for a very complicated
         | change is ask o1 to architect it, specifically asking it to
         | give me a detailed plan for how it would be implemented, but
         | not to actually implement anything. I then change the model to
         | Sonnet 3.5 to have it actually do the implementation.
         | 
         | And on the side of not being able to get models to understand
         | something specific, there's a place in a current project where
         | I use a special Unicode apostrophe during some string parsing
         | because a third-party API needs it. But any code modifications
         | by the AI to that file always replace it with a standard ascii
         | apostrophe. I even added a comment on that line to the effect
         | of "never replaced this apostrophe, it's important to leave it
         | exactly as it is!" And also put that in my cursor rules, and
         | sometimes directly in the prompt as well, but it always
         | replaces it even for completely unrelated changes. I've had to
         | manually fix it like 10 times in the last day, it's
         | infuriating.
        
         | foobiekr wrote:
         | Have you tried any of the specialty services like Augment? I am
         | curious if they are any better or just snake oil.
        
         | chrismsimpson wrote:
         | I've coded in many languages over the years but reasonably new
         | to the TS/JS/Next world.
         | 
         | I've found if you give your prompts a kind long form "stream of
         | consciousness", where you outline snippets of code in markdown
         | along with contextual notes and then summarise/outline at the
         | end what you actually wish to achieve, you can get great
         | results.
         | 
         | Think a long form, single page "documentation" type prompts
         | that alternate between written copy/contextual
         | intent/description and code blocks. Annotating code blocks with
         | file names above the blocks I'm sure helps too. Don't waste
         | your context window on redundant/irrelevant information or
         | code, stating a code sample is abridged or adding commented
         | ellipses seems to do the job.
        
           | twilightfringe wrote:
           | ha! good to confirm! I tend to do this, just kind of as a
           | double-check thing, but never sure if it actually worked or
           | if it was a placebo, lol.
           | 
           | Or end with "from the user's perspective: all the "B"
           | elements should light up in excitement when you click "C""
        
           | mvkel wrote:
           | Going to try this! Thanks for the tip
        
           | d357r0y3r wrote:
           | By the time I've fully documented and explained what I want
           | to be done, and then review the result, usually finding that
           | it's worse than what I would have written myself, I end up
           | questioning my instinct to even reach for this tool.
           | 
           | I like it for general refactoring and day to day small tasks,
           | but anything that's relatively domain-specific, I just can't
           | seem to get anything that's worth using.
        
             | noahbp wrote:
             | Like most AI tools, great for beginners, time-savers for
             | intermediate users, and frequently a waste of time in
             | domains where you're an expert.
             | 
             | I've used Cursor for shipping better frontend slop, and
             | it's great. I skip a lot of trial and error, but not all of
             | it.
        
               | d357r0y3r wrote:
               | I agree that it's amazing as a learning tool. I think the
               | "time to ramp" on a new technology or programming
               | language has probably been cut in half or more.
        
               | epolanski wrote:
               | ,> and frequently a waste of time in domains where you're
               | an expert.
               | 
               | I'm a domain expert and I disagree.
               | 
               | There's many scenarios where using LLMs pays off.
               | 
               | E.g. a long file or very long function are just that, and
               | an LLM is faster at understanding it whole not being
               | limited in how many things you can track in your mind at
               | once (between 4 and 6). It's still gonna be faster at
               | refactoring it and testing it than you will.
        
         | Abishek_Muthian wrote:
         | Just curious, did you try a code model like Codestral instead
         | of a MoE?
        
         | bugglebeetle wrote:
         | o3 mini's date cut-off is 2023, so it's unfortunately not gonna
         | be useful for anything that requires knowledge of recent
         | framework updates, which includes probably all big frontend
         | stuff.
        
         | hombre_fatal wrote:
         | I have the same experience. Just today I was integrating a new
         | logging system with my kubernetes cluster.
         | 
         | I tried out the OP model to make changes to my yaml files. It
         | would give short snippets and I'd have to keep trial and
         | erroring its suggestions.
         | 
         | Eventually I pasted the original prompt to Claude and it one-
         | shot the dang thing with perfect config. Made me wonder why I
         | even try new models.
        
         | pknerd wrote:
         | OT: How many tokens are being consumed? How much are you paying
         | for Claude APIs?
        
         | harshitaneja wrote:
         | So I think I finally understood recently why we have these
         | divergent groups with one thinking Claude 3.5 Sonnet is the
         | best model for coding and another that follow the OpenAI SOTA
         | at that moment. I have been a heavy user of ChatGPT, jumping on
         | to pro without even thinking for more than a second once
         | released. Recently though I took a pause from my usual work on
         | statistical modelling, heuristics work and other things in
         | certain deep domains to focus on building client APIs and
         | frontends and decided to again give Claude a try and it is just
         | so great to work with for this usecase.
         | 
         | My hypothesis is its a difference of what you are doing. OpenAI
         | O models are much better than others at mathematical modelling
         | and such tasks and Claude for more general purpose programming.
        
           | mycall wrote:
           | Have you used multi-agent chat sessions with each fielding
           | their own specialities and seeing if that improves your use
           | cases aka MoE?
        
         | digitcatphd wrote:
         | The reality is I suspect one will use different models for
         | different things. Think of it like having different modes of
         | transportation.
         | 
         | You might use your scooter, bike, car, jet - depending on the
         | circumstances. A bike was invented 100 years ago? But it may be
         | the best in the right use case. Would still be using DaVinci
         | for some things because we haven't bothered swapping it and it
         | works fine.
         | 
         | For me - the value of R1/o3 is visible logic that provides an
         | analysis that can be critiqued by Sonnet 3.5
        
       | energy123 wrote:
       | How to disable the LLM summarized thought traces that get spammed
       | into my chat window with o3-mini-high?
       | 
       | Very annoying now having to manually press the "^" to hide the
       | verbose thought traces _every single question I ask_ , totally
       | breaks flow.
        
       | simonw wrote:
       | Now that the dust is settling a little bit, I have published my
       | notes so far on o3-mini here:
       | https://simonwillison.net/2025/Jan/31/o3-mini/
       | 
       | To save you the click: I think the most interesting things about
       | this model are the price - less than half that of GPT-4o while
       | being better for many things, most notably code - and the
       | increased length limits.
       | 
       | 200,000 tokens input and 100,000 output (compared to 128k/16k for
       | GPT-4o and just 8k for DeepSeek R2 and Claude 3.5 on output)
       | could open up some interesting new applications, especially at
       | that low price.
        
       | zora_goron wrote:
       | Does anyone know, how "reasoning effort" is implemented
       | technically - does this involve differences in the pre-training,
       | RL, or prompting phases (or all)?
        
       | Havoc wrote:
       | 5 hours in 500 odd comments. Definitely feels like this has less
       | wow factor than previous OAI releases.
        
       | jiocrag wrote:
       | This is.... underwhelming.
        
       | mohsen1 wrote:
       | It's funny because I asked it to fix my script to show _deepseek_
       | 's chain of thoughts in the script but it refuses to answer
       | hahaha
        
       | catigula wrote:
       | It's actually a bit comforting that it isn't very good.
        
       | waynecochran wrote:
       | I just had it convert Swift code to Kotlin and was surprised at
       | how the comment was translated. It "knew" the author of the paper
       | and what is was doing!? That is wild.
       | 
       | Swift:                       //             // Double Reflection
       | Algorithm from Table I (page 7)             // in Section 4 of
       | https://tinyurl.com/yft2674p             //             for i in
       | 1 ..< N {                 let X1 = spine[i]                 ...
       | 
       | Kotlin:                       // Use the Double Reflection
       | Algorithm (from Wang et al.) to compute subsequent frames.
       | for (i in 1 until N) {                 val X1 =
       | Vector3f(spine[i])                 ...
        
         | smallerize wrote:
         | Wow, haven't seen a viglink in a while.
        
       | prompt_overflow wrote:
       | Plot twist:
       | 
       | 1. they are trying to obfuscate deepscrape success
       | 
       | 2. they are trying to confuse you. the benchmark margins are
       | minimal (and meaningless)
       | 
       | 3. they are trying to get time (with investors) releasing
       | nothing-special-models in a predicted schedule (jan -> o3, feb ->
       | o3-pro-max, march -> o7-ultra, and in 2026 -> OMG! we've reached
       | singularity! (after spending $500B))
       | 
       | -
       | 
       | And at the end of the day, nothing changes for me and neither for
       | you. enjoy your time out of this sickness ai hype. bruh!
        
       | zone411 wrote:
       | It scores 72.4 on NYT Connections, a significant improvement over
       | the o1-mini (42.2) and surpassing DeepSeek R1 (54.4), but it
       | falls short of the o1 (90.7).
       | 
       | (https://github.com/lechmazur/nyt-connections/)
        
       | Mr_Bees69 wrote:
       | Can't wait till deepseek gets their hands on this
        
       | aussieguy1234 wrote:
       | Just gave it a go using open-webui.
       | 
       | One immediate difference I noticed is that o3-mini actually
       | observes the system prompt you set. So if I say it's a Staff
       | Engineer at Google, it'll stay in character.
       | 
       | That was not possible with o1-mini, it ignored system prompts
       | completely.
        
       | revskill wrote:
       | Models should be better at clarifying the prompt before actually
       | spamming with bad answers.
        
       | energy123 wrote:
       | This might be the best publicly available model for coding:
       | 
       | https://livebench.ai/#/?Coding=as
        
       ___________________________________________________________________
       (page generated 2025-02-01 08:00 UTC)