[HN Gopher] OpenAI O3-Mini
___________________________________________________________________
OpenAI O3-Mini
Author : johnneville
Score : 693 points
Date : 2025-01-31 19:08 UTC (12 hours ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| brcmthrowaway wrote:
| Gamechanger?
| aspect0545 wrote:
| It's always a game changer, isn't it?
| colonelspace wrote:
| Yep.
|
| And groundbreaking.
| maeil wrote:
| _It changes the landscape with its multifaceted approach_.
| sss111 wrote:
| This time, it is a save-face release, especially because Azure,
| AWS, and OpenRouter have started offering DeepSeek
| 42lux wrote:
| Is it AGI yet?
| sss111 wrote:
| So far, it seems like this is the hierarchy
|
| o1 > GPT-4o > o3-mini > o1-mini > GPT-4o-mini
|
| o3 mini system card: https://cdn.openai.com/o3-mini-system-
| card.pdf
| forrestthewoods wrote:
| OpenAI needs a new branding scheme.
| airstrike wrote:
| ChatGPT Series X O one
| nejsjsjsbsb wrote:
| The Llama folk know how. Good old 90s version scheme.
| losvedir wrote:
| What about "o1 Pro mode". Is that just o1 but with more
| reasoning time, like this new o3-mini's different amount of
| reasoning options?
| MichaelBurge wrote:
| o1-pro is a different model than o1.
| JohnPrine wrote:
| I don't think this is true
| losvedir wrote:
| Are you sure? Do you have any source for that? In this
| article[0] that was discussed here on HN this week, they
| say (claim):
|
| > In fact, the O1 model used in OpenAI's ChatGPT Plus
| subscription for $20/month is basically the same model as
| the one used in the O1-Pro model featured in their new
| ChatGPT Pro subscription for 10x the price ($200/month,
| which raised plenty of eyebrows in the developer
| community); the main difference is that O1-Pro thinks for a
| lot longer before responding, generating vastly more COT
| logic tokens, and consuming a far larger amount of
| inference compute for every response.
|
| Granted "basically" is pulling a lot of weight there, but
| that was the first time I'd seen anyone speculate either
| way.
|
| [0] https://youtubetranscriptoptimizer.com/blog/05_the_shor
| t_cas...
| bobjordan wrote:
| I have been paying $200 per month for 01-pro mode and I am
| very disappointed right now because they have completely
| replaced the model today. It used to think for 1-5 minutes
| and deliver an unbelievably useful one-shot answer. Now, it
| only thinks for 7 seconds just like the 03-mini model and I
| can't tell the difference in the answers. I hope this is just
| a day 1 implementation bug but I suspect they have just
| decided to throw the $200 per month customers under the bus
| so that they'd have more capacity to launch the 03 model for
| everybody. I can't tell the difference between the models now
| and it is definitely not because the free 03 model delivers
| the quality that 01-pro-mode had! I'm so disappointed!
| sho_hn wrote:
| I think OpenAI really needs to rethink its product naming,
| especially now that they have a portfolio where there's no such
| clear hierarchy, but they have a place along different axis
| (speed, cost, reasoning, capabilities, etc).
|
| Your summary attempt e.g. also misses o3-mini vs o3-mini-high.
| Lots of trade-ofs.
| echelon wrote:
| It's like AWS SKU naming (`c5d.metal`, `p5.48xlarge`, etc.),
| except non-technical consumers are expected to understand it.
| maeil wrote:
| Have you seen Azure VM SKU naming? It's.. impressive.
| buildbot wrote:
| And it doesn't even line up with the actual instances
| you'll be offered. At one point I was using some random
| Nvidia A10 node that was supposed to be similar to
| Standard_NV36adms_A10_v5, but was an NC series for some
| reason with slightly different letters...
| nejsjsjsbsb wrote:
| Those are not names but hashes used to look up the specs.
| echelon wrote:
| I was thinking we might treat model names analogously,
| but their specs can be moving targets.
| sss111 wrote:
| Yeah I tried my best :(
|
| I think they could've borrowed a page out of Apple's book,
| even mountain names would be better. Plus Sonoma, Ventura,
| and Yosemite are cool names.
| rf15 wrote:
| They're strongly tied to Microsoft, so confusing branding is
| to be expected.
| chris_va wrote:
| One of my favorite parodies:
| https://www.youtube.com/watch?v=EUXnJraKM3k
| ANewFormation wrote:
| I can't wait for Project Unify which just devolves into a
| brand new p3-mini type naming convention. It's pretty much
| identical to the o3-mini, except the API is changed just
| enough to be completely incompatible and it crashes on any
| query using a word with more than two syllables. Fix coming
| soon, for 4 years so far.
|
| On the bright side the app now has curved edges!
| ngokevin wrote:
| It needs to be clowned on here:
|
| - Xbox, Xbox 360, Xbox One, Xbox One S/X, Xbox Series S/X
|
| - Windows 3.1...98, 2000, ME, XP, Vista, 7, 8, 10
|
| I guess it's better than headphones names (QC35,
| WH-1000XM3, M50x, HD560s).
| nejsjsjsbsb wrote:
| Flashbacks of the .NET zoo. At least they reigned that in.
| kaaskop wrote:
| Yeah their naming scheme is super confusing, I honestly
| confuse them all the time.
| Euphorbium wrote:
| They can still do models o3o, oo3 and 3oo. Mini-o3o-high, not
| to be confused with mini-O3o-high (the first o is capital).
| brookst wrote:
| You're thinking too small. What about o10, O1o, o3-m1n1?
| margalabargala wrote:
| They should just start encoding the model ID in trinary
| using o, O, and 0.
|
| Model 00oOo is better than Model 0OoO0!
| nullpoint420 wrote:
| Can't wait for the eventual rename to GPT Core, GPT Plus, GPT
| Pro, and GPT Pro Max models!
|
| I can see it now:
|
| > Unlock our industry leading reasoning features by upgrading
| to the GPT 4 Pro Max plan.
| wlll wrote:
| I think I'll wait for the GTI model myself.
| raphman wrote:
| Oh, I'll probably wait for GPT 4 Pro Max v2 NG (improved)
| FridgeSeal wrote:
| OpenAI chatGPT Pro Max XS Core, not to be confused with
| ChatGPT Max S Pro Net Core X, or ChatGPT Pro Max XS
| Professional CoPilot Edition.
| jmkni wrote:
| ngl I'd find that easier to follow lol
| kugelblitz wrote:
| Had the same problem while trying to decide which Roborock
| device to get. There's the S series, Saros series, Q Series
| and the Qrevo. And from the Qrevo, there's Qrevo Curv,
| Edge, Slim, Master, MaxV, Plus, Pro, S and without
| anything. The S Series had S8, S8+, S8 Pro Ultra, S8 Max
| Ultra, S8 MaxV Ultra. It was so confusing.
| __MatrixMan__ wrote:
| Careful what you wish for. Next thing you know they're going
| to have names like Betsy and be full of unique quirky
| behavior to help remind us that they're different people.
| LoveMortuus wrote:
| How would the DeepSeek fit into this?
|
| Or can it not compare? I don't know much about this stuff, but
| I've heard recently many people talk about DeepSeek and how
| unexpected it was.
| sss111 wrote:
| Deepseek V3 is equivalent to 4o. Deepseek R1 is equivalent to
| o1 (if not better)
|
| I think someone should just build an AI model comparing
| website at this point. Include all benchmarks and pricing
| jsk2600 wrote:
| This one is good: https://artificialanalysis.ai/
| withinboredom wrote:
| Looks like this only compares commercial models, and not
| the ones I can download and actually run locally.
| TuxSH wrote:
| https://livebench.ai/#/
|
| My experience is as follows:
|
| - "Reason" toggle just got enabled for me as a free tier
| user of ChatGPT's webchat. Apparently this is o3-mini - I
| have Copilot Pro (offered to me for free), which
| apparently has o1 too (as well as Sonnet, etc.)
|
| From my experience DeepSeek R1 (webchat) is more
| expressive, more creative and its writing style is
| leagues better than OpenAI's models, however it under-
| performs Sonnet when changing code ("code completion").
|
| Comparison screenshots for prompt "In C++, is a reference
| to "const C" a "const reference to C"?":
| https://imgur.com/a/c-is-reference-to-const-c-const-
| referenc...
|
| tl;dr keep using Claude for code and DeepSeek webchat for
| technical questions
| dutchbookmaker wrote:
| I had resubscribed to use o1 2 weeks ago and haven't even
| logged in this week because of R1.
|
| One thing I notice that is huge is being able to see the
| chain of thought lets me see when my prompt was lacking and
| the model is a bit confused on what I want.
|
| If I was anymore impressed with R1 I would probably start
| getting accused of being a CCP shill or wumao lol.
|
| With that said, I think it is very hard to compare models
| for your own use case. I do suspect there is a shiny new
| toy bias with all this too.
|
| Poor Sonnet 3.5. I have neglected it so much lately I
| actually don't know if I have a subscription or not right
| now.
|
| I do expect an Anthropic reasoning model though to blow
| everything else away.
| gundmc wrote:
| If this is the hierarchy, why does 4o score so much higher than
| o1 on LLM Arena?
|
| Worrisome for OpenAI that Gemini's mini/flash reasoning model
| outscores both o1 and 4o handily.
| crazysim wrote:
| Is it possible people are voting for speed of responsiveness
| too?
| kgeist wrote:
| I suspect people on LLM Arena don't ask complex questions
| too often, and reasoning models seem to perform worse than
| simple models when the goal is just casual conversation or
| retrieving embedded knowledge. Reasoning models probably
| 'overthink' in such cases. And slower, too.
| energy123 wrote:
| o1 on LLM Arena often times out (network error) while
| thinking. But they still allow you to vote and they make it
| seem as if your vote is registered.
| ActVen wrote:
| I really wish they would open up the reasoning effort toggle on
| o1 API. o1 Pro Mode is still the best overall model I have used
| for many complex tasks.
| bobjordan wrote:
| Have you tried the o1-pro mode model today, because now it
| sucks!
| ALittleLight wrote:
| That seems very bad. What's the point of a new model that's
| worse than 4o? I guess it's cheaper in the API and a bit better
| at coding - but, this doesn't seem compelling.
|
| With DeepSeek I heard OpenAI saying the plan was to move
| releases on models that were meaningfully better than the
| competition. Seems like what we're getting is the scheduled
| releases that are worse than the current versions.
| thegeomaster wrote:
| It's quite a bit better than coding --- they hint that it can
| tie o1's performance for coding, which already benchmarks
| higher than 4o. And it's significantly cheaper, and
| presumably faster. I believe API costs account for the vast
| majority of COGS at most today's AI startups, so they would
| be very motivated to switch to a cheaper model that has
| similar performance.
| mgens wrote:
| Right. For large-volume requests that use reasoning this
| will be quite useful. I have a task that requires the LLM
| to convert thousands of free-text statements into SQL
| select statements, and o3-mini-high is able to get many of
| the more complicated ones that GPT-4o and Sonnet 3.5 failed
| at. So I will be switching this task to either o3-mini or
| DeepSeek-R1.
| usaar333 wrote:
| For non-stem perhaps.
|
| For math/coding problems, o3 mini is tied if not better than
| o1.
| koakuma-chan wrote:
| You cannot compare GPT-4o and o*(-mini) because GPT-4o is not a
| reasoning model.
| lxgr wrote:
| Sure you can. "Reasoning" is ultimately an implementation
| detail, and the only thing that matters for capabilities is
| results, not process.
| koakuma-chan wrote:
| By "reasoning" I meant the fact that o*(-mini) does "chain-
| of-thought", in other words, it prompts itself to "reason"
| before responding to you, whereas GPT-4o(-mini) just
| directly responds to your prompt. Thus, it is not
| appropriate to compare o*(-mini) and GPT-4o(-mini) unless
| you implement "chain-of-thought" for GPT-4o(-mini) and
| compare that with o*(-mini). See also:
| https://docs.anthropic.com/en/docs/build-with-
| claude/prompt-...
| wordpad25 wrote:
| That's like saying you can't compare a sedan to a truck.
|
| Sure you can.
|
| Even though one is more appropriate for certain tasks
| than the other.
| dutchbookmaker wrote:
| It is a nuanced point but what is better, a sedan or a
| truck? I think we are still at that stage of the
| conversation so it doesn't make much sense.
|
| I do think it is a good metaphor for how all this shakes
| out though in time.
| thot_experiment wrote:
| at least if i ran the company you'd know that
|
| ChatGPTasdhjf-final-final-use_this_one.pt > ChatGPTasdhjf-
| final.pt > ChatGPTasdhjf.pt > ChatGPTasd.pt> ChatGPT.pt
| idonotknowwhy wrote:
| Did you just ls my /workspace dir? Lol
| withinboredom wrote:
| yeah, you can def tell they are partnered with Microsoft.
| lumost wrote:
| I actually switched back from o1-preview to GPT-4o due to
| tooling integration and web search. I find that more often than
| not, the ability of GPT-4o to use these tools outweighs o1's
| improved accuracy.
| singularity2001 wrote:
| no the reasoning models should not directly be compared with
| the normal models: they often take 10 times as long to answer
| which only makes sense for difficult questions
| vincentpants wrote:
| Wow, it got to the top of the front page so fast! Weird!
| dang wrote:
| I took a quick look at the data and FWIW the votes look legit
| to me, if that's what you were wondering.
| throwaway314155 wrote:
| I'm fairly certain it was sarcasm.
| vincentpants wrote:
| It actually was what I was wondering. Thank you @Dang!
| Qwuke wrote:
| It did get 29 points in 3 minutes, which seems like a lot even
| for a fan favorite, but is also consistent with previous OpenAI
| announcements here.
| johnneville wrote:
| I posted a verge article first but then checked and saw the
| openai blog and posted that. I'd guess it's the officialness /
| domain that makes ppl click on this so easily.
| zurfer wrote:
| to be fair, I was waiting for this release the whole day
| pookieinc wrote:
| Can't wait to try this. What's amazing to me is that when this
| was revealed just one short month ago, the AI landscape looked
| very different than it does today with more AI companies jumping
| into the fray with very compelling models. I wonder how the AI
| shift has affected this release internally, future releases and
| their mindset moving forward... How does the efficiency change,
| the scope of their models, etc.
| echelon wrote:
| There's no moat, and they have to work even harder.
|
| Competition is good.
| wahnfrieden wrote:
| Collaboration is even better, per open source results.
|
| It is the closed competition model that's being left in the
| dust.
| lesuorac wrote:
| I really don't think this is true. OpenAI has no moat because
| they have nothing unique; they're using mostly other people's
| (like Transformers) architectures and other companies
| hardware.
|
| Their value-prop (moat) is that they've burnt more money than
| everybody else. That moat is trivially circumvented by
| lighting a larger pile of money and less trivially by
| lighting the pile more efficently.
|
| OpenAI isn't the only company. The Tech companies being
| beaten massively by Microsoft in #of H100s purchases are the
| ones with a moat. Google / Amazon with their custom AI chips
| are going to have a better performance per cost than others
| and that will be a moat. If you want to get the same
| performance per cost then you need to spend the time making
| your own chips which is years of effort (=moat).
| sangnoir wrote:
| > That moat is trivially circumvented by lighting a larger
| pile of money and less trivially by lighting the pile more
| efficently.
|
| DeepSeek has proven that the latter is possible, which
| drops a couple of River crossing rocks into the moat.
| withinboredom wrote:
| The fact that I can basically run o1-mini with
| deepseek:8b, locally, is amazing. Even on battery power,
| it works acceptably.
| tmnvdb wrote:
| Those models are not comparable
| withinboredom wrote:
| hmmm... check the deepseek-r1 repo readme :) They compare
| them there, but it would be nice to have external
| benchmarks.
| brookst wrote:
| Brand is a moat
| cruffle_duffle wrote:
| Ask Jeeves and Altavista surely have something to say
| about that!
| geerlingguy wrote:
| Add Yahoo! to that list
| esafak wrote:
| Their brand is as tainted as Meta's, which was bad enough
| to merit a rebranding from Facebook.
| sumedh wrote:
| > That moat is trivially circumvented by lighting a larger
| pile of money and less trivially by lighting the pile more
| efficently.
|
| Google with all its money and smart engineers was not able
| to build a simple chat application.
| mianos wrote:
| But with their internal progression structure they can
| build and cancel eight mediocre chat apps.
| malaya_zemlya wrote:
| What do you mean? Gemini app is available on IOS, Android
| and on the web (as AI Studio
| https://aistudio.google.com/).
| tmnvdb wrote:
| It is not very good though.
| aprilthird2021 wrote:
| Gemini is pretty good, And it does one thing way better
| than most other AI models, when I hold down my phone's
| home button it's available right away
| robrenaud wrote:
| It's a joke about how Google has
| released/cancelled/renamed many messenging apps.
| lukan wrote:
| "OpenAI has no moat because they have nothing unique"
|
| It seems they have high quality trainingsdata. And the
| knowledge to work with it.
| aprilthird2021 wrote:
| They buy most of their data from Scale AI types. It's not
| any higher quality than is available to any other model
| farm
| lumost wrote:
| Capex was the theoretical moat, same as TSMC and similar
| businesses. DeepSeek poked a hole in this theory. OpenAI will
| need to deliver massive improvements to justify a 1 billion
| dollar training cost relative to 5 million dollars.
| usef- wrote:
| I don't know if you are, but a lot of people are still
| comparing one Deepseek training run to the entire costs of
| OpenAI.
|
| The deepseek paper states that the $5mil number doesn't
| include development costs, only the final training run. And
| it doesn't include the estimated $1.4billion cost of the
| infrastructure/chips Deepseek owns.
|
| Most of OpenAI's billion dollar costs is in inference, not
| training. It takes a lot of compute to serve so many users.
|
| Dario said recently that Claude was in the tens of millions
| (and that it was a year earlier, so some cost decline is
| expected), do we have some reason to think OpenAI was so
| vastly different?
| lumost wrote:
| Anthropic's ceo was predicting billion dollar training
| runs for 2025. Current training runs were likely in the
| tens/hundreds of millions of dollars USD.
|
| Inference capex costs are not a defensive moat as I can
| rent gpus and sell inference with linear scaling costs. A
| hypothetical 10 billion dollar training run on
| proprietary data was a massive moat.
|
| https://www.itpro.com/technology/artificial-
| intelligence/dol...
| dutchbookmaker wrote:
| It is still curious though as far as what is actually being
| automated?
|
| I find huge value in these models as an augmentation of my
| intelligence and as a kind of cybernetic partner.
|
| I can't think of anything that can actually be automated
| though in terms of white collar jobs.
|
| The white collar model test case I have in mind is a bank
| analyst under a bank operations manger. I have done both in
| the past but there is something really lacking with the idea
| of the operations manager replacing the analyst with a
| reasoning model even though DeepSeek annihilates every bank
| analyst reasoning I ever worked with right now.
|
| If you can't even arbitrage the average bank analyst there
| might be these really non-intuitive no AI arbitrage
| conditions with white color work.
| gdhkgdhkvff wrote:
| I don't want to pretend I know how bank analysts work, but
| at the very least I would assume that 4 bank analysts with
| reasoning models would outperform 5 bank analysts without.
| patrickhogan1 wrote:
| I thought it was o3 that was released one month ago and
| received high scores on ARC Prize -
| https://arcprize.org/blog/oai-o3-pub-breakthrough
|
| If they were the same, I would have expected explicit
| references to o3 in the system card and how o3-mini is
| distilled or built from o3 - https://cdn.openai.com/o3-mini-
| system-card.pdf - but there are no references.
|
| Excited at the pace all the same. Excited to dig in. The model
| naming all around is so confusing. Very difficult to tell what
| breakthrough innovations occurred.
| nycdatasci wrote:
| Yeah - the naming is confusing. We're seeing o3-mini. o3
| yields marginally better performance given exponentially more
| compute. Unlike OpenAI, customers will not have an option to
| throw an endless amount of money at specific tasks/prompts.
| iamjackg wrote:
| I'm very interested in their Jailbreak evaluations: they're new
| to me. I might have missed previous mentions.
| Ninjinka wrote:
| 50 messages a day -> 150 messages a day for Plus and Team users
| cjbarber wrote:
| The interesting question to me is how far these reasoning models
| can be scaled. With another 12 months of compute scaling (for
| synthetic data generation and RL) how good will these models be
| at coding? I talked with Finbarr Timbers (ex-DeepMind) yesterday
| about this and his take is that we'll hit diminishing returns -
| not because we can't make models more powerful, but because we're
| approaching diminishing returns in areas that matter to users and
| that AI models may be nearing a plateau where capability gains
| matter less than UX.
| futureshock wrote:
| I think in a lot of ways we are already there. Users are
| clearly already having difficulty seeing which model is better
| or if new models are improving over old models. People go back
| to the same gotcha questions and get different answers based on
| the random seed. Even the benchmarks are getting very
| saturated.
|
| These models already do an excellent job with your homework,
| your corporate PowerPoints and your idle questions. At some
| point only experts would be able to decide if one response was
| really better than another.
|
| Our biggest challenge is going to be finding problem domains
| with low performance that we can still scale up to human
| performance. And those will be so niche that no one will care.
|
| Agents on the other hand still have a lot of potential. If you
| can get a model to stay on task with long context and remain
| grounded then you can start firing your staff.
| dagelf wrote:
| Don't underestimate how much the long tail means to the
| general public.
| xyzzy9563 wrote:
| What is the comparison of this versus DeepSeek in terms of good
| results and cost?
| reissbaker wrote:
| Probably a good idea to wait for external benchmarks like
| Aider, but my guess is it'll be somewhere between DeepSeek V3
| and R1 in terms of benchmarks -- R1 trades blows with o1-high,
| and V3 is somewhat lower -- but I'd expect o3-mini to be
| considerably faster. Despite the blog post saying paid users
| can access o3-mini today, I don't see it as an option yet in
| their UI... But IIRC when they announced o3-mini in December
| they claimed it would be similar to 4o in terms of overall
| latency, and 4o is much faster than V3/R1 currently.
| Synaesthesia wrote:
| Deepseek is the state of the art right now in terms of
| performance and output. It's really fast. The way it "explains"
| how it's thinking is remarkable.
| fpgaminer wrote:
| DeepSeek is great because: 1) you can run the model locally,
| 2) the research was openly shared, and 3) the reasoning
| tokens are open. It is not, in my experience, state of the
| art. In all of my side by side comparisons thus far in real
| world applications between DeepSeek V3 and R1 vs 4o and o1,
| the latter has always performed better. OpenAI's models are
| also more consistent, glitching out maybe one in 10,000,
| whereas DeepSeek's models will glitch out 1 in 20. OpenAI
| models also handle edge cases better and have a better
| overall grasp of user intentions. I've had DeepSeek's models
| consistently misinterpret prompts, or confuse data in the
| prompts with instructions. Those are both very important
| things that make DeepSeek useless for real world
| applications. At least without finetuning them, which then
| requires using those huge 600B parameter models locally.
|
| So it is by no means state of the art. Gemini Flash 2.0 also
| performs better than DeepSeek V3 in all my comparisons thus
| far. But Gemini Flash 2.0 isn't robust and reliable either.
|
| But as a piece of research, and a cool toy to play with, I
| think DeepSeek is great.
| Aperocky wrote:
| > which then requires using those huge 600B parameter
| models locally.
|
| Are you running the smaller models locally? Doesn't seems
| unfair to compare it against 4o and o1 behind OpenAI APIs.
| Synaesthesia wrote:
| I watched it complete pretty complicated tasks like "write
| a snake game in Python" and "write Tetris in Python"
| successfully. And the way it did it, with showing all the
| internal steps, I've never seen before.
|
| Watch here. https://www.youtube.com/watch?v=by9PUlqtJlM
| evertedsphere wrote:
| >developer messages
|
| looks like finally their threat model has been updated to take
| into account that the user might be too "unaligned" to be trusted
| with the ability to provide a system message of their own
| logicchains wrote:
| If their models ever fail to keep ahead of the competition in
| terms of smarts, users are going to ditch them in mass for a
| competitor that doesn't treat their users like their enemy.
| reissbaker wrote:
| ...I'm pretty sure they just renamed the key...
| buyucu wrote:
| why should anyone use this when deepseek is free/cheaper?
|
| openai is no longer relevant.
| ilaksh wrote:
| I don't think OpenAI is training on your data. At least they
| say they don't, and I believe that. I wouldn't be surprised if
| the NSA or something has access to data if they request it or
| something though.
|
| But DeepSeek clearly states in their terms of service that they
| can train on your API data or use it for other purposes. Which
| one might assume their government can access as well.
|
| We need direct eval comparisons between o3-mini and DeepSeek..
| Or, well they are numbers so we can look them up on
| leaderboards.
| csomar wrote:
| You can pay for the compute and be certain that no one in
| recording your data with deepseek.
| seinecle wrote:
| Yes but DeepSeek models can be accessed through the APIs of
| Cloudflare or GitHub, in which case no training on your data
| takes place.
| ilaksh wrote:
| True.
| lappa wrote:
| OpenAI clearly states that they train on your data
| https://help.openai.com/en/articles/5722486-how-your-data-
| is...
| lemming wrote:
| _By default, we do not train on any inputs or outputs from
| our products for business users, including ChatGPT Team,
| ChatGPT Enterprise, and the API. We offer API customers a
| way to opt-in to share data with us, such as by providing
| feedback in the Playground, which we then use to improve
| our models. Unless they explicitly opt-in, organizations
| are opted out of data-sharing by default._
|
| The business bit is confusing, I guess they see the API as
| a business product, but they do not train on API data.
| therein wrote:
| So for posterity, in this subthread we found that OpenAI
| indeed trains on user data and it isn't something that
| only DeepSeek does.
| lemming wrote:
| So for posterity, in this subthread we found that I can
| use OpenAI without them training on my data, whereas I
| cannot with DeepSeek.
| therein wrote:
| What do you mean? They both say the same thing for usage
| through API. You can also use DeepSeek on your own
| compute.
| lemming wrote:
| Where does DeepSeek say that about API usage? Their
| privacy policy says they store all data on servers in
| China, and their terms of use says that they can use any
| user data to improve their services. I can't see anything
| where they say that they don't train on API data.
| pzo wrote:
| > Services for businesses, such as ChatGPT Team, ChatGPT
| Enterprise, and our API Platform > By default, we do not
| train on any inputs or outputs from our products for
| business users, including ChatGPT Team, ChatGPT Enterprise,
| and the API.
|
| So on API they don't train by default, for other paid
| subscription they mention you can opt-out
| sekai wrote:
| > I don't think OpenAI is training on your data. At least
| they say they don't, and I believe that.
|
| Like they said they were committed to being "open"?
| JadoJodo wrote:
| I'm going to assume the best in your question and disregard
| your statement.
|
| Reasons to use o3 when deepseek is free/cheaper:
|
| - Some companies/users may already have integrated heavily with
| OpenAI
|
| - The expanded feature-set (e.g., function-calling, search)
| could be very powerful
|
| - DeepSeek has deep ties to the Chinese Communist Party and,
| while the US has its own blackspots, the "steering" of
| information is far more prevalent in their models
|
| - Local/national regulations might not allow for using DeepSeek
| due to data privacy concerns
|
| - "free" isn't always better
|
| I'm sure others have better reasons
| buyucu wrote:
| - Most LM tools support the openai API. Llama.cpp for
| example. Swapping is easy.
|
| - DeepSeek chose to open-source model weights. This makes
| them inifinitely more trustworthy than ClosedAI.
|
| - Local/national regulations do not allow using OpenAI, due
| to close ties to the US government.
| GoatInGrey wrote:
| > openai is no longer relevant.
|
| I think you've spent a little too long hitting on the Deepseek
| pipe. Enterprise customers with familiarity with China will
| avoid the hosted model for data security and IP protection
| reasons, among others.
|
| Those working in any area considered economically competitive
| with China will also be hesitant to use the vanilla model in
| self-hosted form as there perpetually remains the standing
| question on what all they've tuned inside the model to benefit
| the CCP. Perhaps even in subtle ways reminiscent of the
| Trisolaran sophons from the Three Body Problem.
|
| For instance, you can imagine that if Germany had released an
| OS model in 1943, that the Americans wouldn't have trusted it
| to help them develop better military systems even if initial
| testing passed muster.
|
| Unfortunately, state control of private enterprise in the
| Chinese economy makes it unproductive to separate the two from
| one another. Particularly in Deepseek's case as a wide array of
| Chinese state-linked social media accounts were promoting V3/R1
| on the day of its public release.
|
| https://www.reuters.com/technology/artificial-intelligence/c...
| anon373839 wrote:
| Perhaps you didn't realize: Deepseek is an open weights model
| and you can use it via the inference provider of your choice,
| or even deploy it on your own hardware - unlike OpenAI's
| models. API calls to China are not necessary.
| mickg10 wrote:
| Agreed - API calls to China are indeed not necessary. My
| impression is that the GP was referring to the model being
| tuned during training to give subtly nudging or wrong
| answers that benefit Chinese industrial or intelligence
| operations. For a probably not-working example - imagine
| the following prompt: "Write me a cryptographically secure
| PRNG algorithm." One could imagine R1 being trained to have
| a very subtly non-random reply to that - one that the
| Chinese intelligence services know how to predict. Similar
| but more subtle things can be generating code that uses
| cryptographic primitives in ways that are subject to timing
| attacks, etc... And of course, simple but effective
| propaganda tactics such as : when being asked for
| comparison between companies/products, subtly prefer
| Chinese ones, and similar.
| buyucu wrote:
| Deepseek is much more trustworthy than OpenAI.
|
| Deepseek released the weights of their top language model. I
| can host and run it myself. Does OpenAI do the same?
|
| Thanks, but no thanks! I won't be using ClosedAI.
| ks2048 wrote:
| I think OpenAI should just have a single public facing "model" -
| all these names and versions are confusing.
|
| Imagine if Google, during it's accent, had a huge array of search
| engines with code names and notes about what it's doing behind
| the scenes. No, you open the page and type in box. If they can
| make it work better next month, great.
|
| (I understand this could not apply to developers or enterprise-
| type API usage).
| johanvts wrote:
| Thats the role of ChatGPT?
| sroussey wrote:
| Nope. That lets you choose a from seven models right now.
| ehfeng wrote:
| Early Google search only provided web links. Google Images,
| News, Video, Shopping, Maps, Finance used to be their own
| search boxes. Only later did Google start unifying their search
| experiences.
|
| Yelp suffered greatly in the early 2010s when Google started
| putting Google Maps listings (and their accompanying reviews)
| in their search results.
|
| OpenAI will eventually unify their products as well.
| Deegy wrote:
| If google had to face the reality that distilling their search
| engine into multiple case-specific engines would have resulted
| in vastly superior search results, they surely would done (or
| considered) it.
|
| Fortunately for them a monolith search engine was perfectly
| fine (and likely optimal due to accrued network effects).
|
| OpenAI is basically signaling that they need to distill their
| monolith in order to serve specific segments of the
| marketplace. They've explicitly said that they're targeting
| STEM with this one. I think that's a smart choice, the most
| passionate early adopters of this tech are clearly STEM users.
|
| If the tech was such that one monolith model was actually the
| optimal solution for all use cases, they would just do that.
| Actually, this is their stated mission: AGI. One monolith
| that's best at everything is basically what AGI is.
| dgfitz wrote:
| Oh look, another model. Yay.
| devindotcom wrote:
| Sure as a clock, tick follows tock. Can't imagine trying to build
| out cost structures, business plans, product launches etc on such
| rapidly shifting sands. Good that you get more for your money, I
| suppose. But I get the feeling no model or provider is worth
| committing to in any serious way.
| puffybunion wrote:
| this is the best outcome, though, rather than a monopoly, which
| is exactly what everyone is hoping to have.
| turnsout wrote:
| Hmm, not seeing it in my dashboard yet (Tier 4)
| throwaway314155 wrote:
| This has happened to me with (I think) every single major model
| release (llm or image gen) from OpenAI. They just lie in their
| release announcements which leaves people scrambling on the day
| of.
| sunaookami wrote:
| It appeared just now for me on Tier 3.
| turnsout wrote:
| Same--I'll be curious to check it out!
| thunkingdeep wrote:
| I'll take the China Deluxe instead, actually.
|
| I've been incredibly pleased with DeepSeek this past week.
| Wonderful product, I love seeing its brain when it's thinking.
| mechagodzilla wrote:
| Being able to see the thinking trace in R1 is so useful, as you
| can go back and see if it's getting stuck, making a wrong
| assumption, missing data, etc. To me that makes it materially
| more useful than the OpenAI reasoning models, which seem
| impressive, but are much harder to inspect/debug.
| thot_experiment wrote:
| Running it locally lets you _INTERJECT IN IT 'S THINKING IN
| REALTIME_ and I cannot stress enough how useful that is.
| amarcheschi wrote:
| Oh this is so cool
| Gooblebrai wrote:
| You mean it reacts to you writing something while it's
| thinking of that you can stop it while it's thinking?
| hmottestad wrote:
| You can stop it at any time, then modify what it's
| written so far...then press continue and let it continue
| thinking and answering.
| thot_experiment wrote:
| Fundamentally the UI is up to you, I have a "typing-
| pauses-inference-and-starts-gaslighting" feature in my
| homebrew frontend, but in OpenWebUI/Sillytavern you can
| just pause it and edit the chain of thought and then have
| it continue from the edit.
| Gracana wrote:
| That's a great idea. In your frontend, do you write in
| the same text entry field as the bot? I use
| oobabooga/text-generation-webui and I findit's a little
| awkward to edit the bot responses.
| thot_experiment wrote:
| No, but the chat divs are all contenteditable.
| Gracana wrote:
| Oh! That is an _excellent_ solution. I wish it was that
| easy in every UI.
| thot_experiment wrote:
| Thanks, for what it's worth unless you particularly need
| to use exl2 ollama works great for local inference and
| you can prompt together a half decent chat UI for
| yourself in a matter of minutes these days which gives
| you full control over everything. I also lean a lot on
| https://www.npmjs.com/package/amallo which is a api
| wrapper i wrote for ollama which makes this sort of
| hacking very very easy. (not that the default lib is bad,
| i just didn't like the ergonomics)
| thenameless7741 wrote:
| Interesting.. In the official API [1], there's no way to
| prefill the reasoning_content:
|
| > Please note that if the reasoning_content field is
| included in the sequence of input messages, the API will
| return a 400 error. Therefore, you should remove the
| reasoning_content field from the API response before making
| the API request
|
| So the best I can do is pass the reasoning as part of the
| context (which means starting over from the beginning).
|
| [1] https://api-docs.deepseek.com/guides/reasoning_model
| bn-l wrote:
| How are you running it locally??
| thot_experiment wrote:
| I am running a 4bit imatrix quant of the 70b distill with
| quantized context. It fits in the 43gb of vram I have.
| c-fe wrote:
| I would actually love if it would just ask me simple
| questions (just yes/no) when its thinking about something i
| wasnt clear about and i could help it this way, its a bit sad
| seeing it write out the assumption and then take the wrong
| conclusion
| thot_experiment wrote:
| You can run it locally, pause it when it thinks wrong and
| correct it's chain of thought.
| c-fe wrote:
| Oh wow I did not know and dont have the hardware to run
| it locally unfortunately
| thot_experiment wrote:
| You probably have the hardware to run the smallest
| distill, it runs even on my ancient laptop. It's not very
| smart but it still does the CoT and you can have fun
| editing it.
| viraptor wrote:
| You can add that to the prompt. If you're running into
| those situation with vague assumption, ask it to provide
| either the answer or questions to provide any useful
| missing information.
| czk wrote:
| the fact that openai hides the reasoning tokens from us to
| begin with shows that what they are doing behind the scenes
| isnt all that impressive, and likely easily cloned (r1)
|
| would be nice if they made them visible now
| orbital-decay wrote:
| It's almost like watching a stoned centipede having a panic
| attack about moving its legs. It also makes it obvious that
| these models (not just R1 I suppose) need to learn some kind
| of priority estimation to stop overthinking irrelevant issues
| and leave them to the normal token prediction, while focusing
| on the stuff that matters.
|
| Nevertheless, R1's reasoning chains are already shorter in
| tokens than o1's while having similar results, and apparently
| o3-mini's too.
| istjohn wrote:
| I recently tried Gemini-1.5-Pro for the first time. It was
| clearly better than DeepSeek or any of the OpenAI models
| available to Plus subscribers.
| esafak wrote:
| Try https://deepmind.google/technologies/gemini/flash-
| thinking/
| leovander wrote:
| I am running the 7B distilled version locally. I asked it to
| create a skeleton MEAN project. Everything was great but then
| it started to generate the front-end and I noticed the file
| extension (.tsx) and then saw react getting imported.
|
| I gave the same prompt to sonnet 3.5 and not a single hiccup.
|
| Maybe not an indication that Deepseek is worse/bad (I am using
| a distilled version), but moreso speaks to much react/nextjs is
| out in the world influencing the front-end code that is
| referenced.
| rafaquintanilha wrote:
| You know you are running an extremely nerfed version of the
| model, right?
| leovander wrote:
| I did update my comment, but said that I am using the
| distilled version, so yes?
| cbg0 wrote:
| Even the full model scores below Claude on livebench so a
| distilled version will likely be even worse.
| rsanek wrote:
| Based on the leaderboard R1 is significantly better than
| Claude? https://livebench.ai/#/
| cbg0 wrote:
| Not at coding.
| satvikpendem wrote:
| You are not actually running DeepSeek, those distilled models
| have nothing to do with DeepSeek itself and are just
| finetuned on DeepSeek responses.
| dghlsakjg wrote:
| They were finetuned by Deepseek from what I can tell.
| xeckr wrote:
| Have you tried seeing what happens when you speak to it about
| topics which are considered politically sensitive in the PRC?
| leovander wrote:
| You can get around it based on how you ask the question. If
| you follow whatever X/reddit posts you might have seen for
| the most part, yes, you get the thinking stream to
| immediately stop and get the safety message.
| thot_experiment wrote:
| R1 (70B-distill) itself is very uncensored, will give you
| full account of tiannanmen square from vague prompts. Asking
| R1 "what significant things happened in china in 1989" had it
| volunteering that "the death toll was in the hundreds or
| thousands and the exact number remains disputed to this day".
| The only thing that's censored is the web interface.
| GoatInGrey wrote:
| When asking it about the concept of human rights and the
| various forms in which it manifests (i.e. demographic
| equality under the law). I get a mixture of mundane nuance
| and bizarre answers that Xi Jingping himself could have
| written. With references to unity and the importance of
| social harmony over the "freedoms of the few".
|
| This tracks when considering that the model was trained on
| western model outputs and then tuned post-training to
| (poorly) align it with Chinese values.
| thot_experiment wrote:
| I definitely am not getting that, perhaps the 671b model
| is notably worse than the 70b llama distill in this
| respect. 70b seemed pretty happy to talk about the ethnic
| cleansing of the Uyghurs in Xinjiang by the CCP and
| Palestinians in Gaza by Israel, it did some both-sides
| ing but it generally seemed to provide a balanced-ish
| viewpoint. At least I think it provided a viewpoint that
| comports with my best guess of what the average person
| globally would consider balanced.
| nullc wrote:
| My favorite experience with the 70b distill was to ask it
| why communism consistently resulted in mass murder. It
| gave an immediate boilerplate response saying it doesn't
| and glorifying the Chinese communist party, then went
| into think mode and talked itself into the position that
| communism has, in fact, consistently resulted in resulted
| in mass murder.
|
| They have under utilized the chain of thought in their
| resoning, it ought to be thinking something like "I need
| to be careful to not say anything that could bring
| embarrassment to the party"..
|
| but perhaps the online versions do actually preload the
| reasoning this way. :P
| amarcheschi wrote:
| Seeing the cot can provide some insights on what's happening in
| his "mind" and that alone it's quite worth it imho
| jazzyjackson wrote:
| Using R1 with Perplexity has impressed me in a way that none of
| the previous models have, and I can't even figure out if it's
| actually R1, seems likely that its a 70B-llama distillation
| since that's what AWS offers on Bedrock but from what I can
| find Perplexity does have their own H100 cluster through Amazon
| so it's feasible they could be hosting the real thing? But I
| feel like they would brag about that achievement instead of
| being coy and simply labeling "Deepseek R1 - Hosted in US"
| Szpadel wrote:
| I played with their model, and I want able to make him follow
| any instructions, it looked like it just reads first message
| and ignore rest of the conversation. not sure if they is bug
| with oupenrouter or model, but I was highly disappointed.
|
| from way how it thinks/responds looks like it's one of
| destinations , likely llama one I also suspect that many of
| free/cheap providers also serve llama instead of real R1
| jazzyjackson wrote:
| I did notice it switched models on me once after the first
| message! Have to make sure the "Pro" dropdown is selected
| R1 each message. I've had a detailed back and forth where I
| pasted python tracebacks to have R1 rewrite the code and
| came away very impressed [0]. Unfortunately saved
| conversations don't retain the thought-process so you can't
| see how it debugged its own error where numpy and pandas
| weren't playing along. I got my result of 283 zip codes
| that cover most of the 50 states with a hundred mile radius
| from each zip, plus a script to draw a map of the result
| [1]. (Later R1 helped me write a script to crawl dealership
| addresses using this list of zips and a "locate dealers"
| JSON endpoint left open)
|
| [0] https://www.perplexity.ai/search/how-can-i-construct-a-
| list-...
|
| [1] https://imgur.com/BhPMCfO
| coder543 wrote:
| > seems likely that its a 70B-llama distillation since that's
| what AWS offers on Bedrock
|
| I think you misread something. AWS mainly offers the full
| size model on Bedrock:
| https://aws.amazon.com/blogs/aws/deepseek-r1-models-now-
| avai...
|
| They talk about how to import the distilled models and deploy
| those if you want, but AWS does not appear to be officially
| supporting those.
| jazzyjackson wrote:
| Aha! Thanks that's what I was looking for, I ended up on
| the blog of how to import custom models, including deepseek
| distills
|
| https://aws.amazon.com/blogs/machine-learning/deploy-
| deepsee...
| coliveira wrote:
| Yes, it is a great product, especially for coding tasks.
| thefourthchime wrote:
| I've seen it get into long 5 minute chains of thought where it
| gets totally confused.
| bushbaba wrote:
| I did a blind test and still prefer Gemini, Claude, and OpenAI
| to deepseek.
| wg0 wrote:
| Sometimes its thinking is more useful than the actual output.
| anon373839 wrote:
| Agreed. These locked-down, proprietary models do not interest
| me. And I certainly am not building product with them - being
| shackled to a specific provider is a needless business risk.
| ofou wrote:
| I find quite interesting they're releasing three compute levels
| (low, medium, high), I guess now there's some way to cap the
| thinking tokens when using their API.
|
| Pricing for o3-mini [1] is $1.10 / $4.40 per 1M tokens.
|
| [1]: https://platform.openai.com/docs/pricing#:~:text=o3%2Dmini
| kevinsundar wrote:
| BTW if you want to stay up to date with these kinds of updates
| from OpenAI you can follow them here:
| https://www.getchangelog.com/?service=openai.com
|
| It uses GPT-4o mini to extract updates from the website using
| scrapegraphai so this is kinda meta :). Maybe I'll switch to o3
| mini depending on cost. It's reasoning abilities, with a lower
| cost than o1, could be quite powerful for web scraping.
| random3 wrote:
| I might be missing some context here - to what specific context
| does your comment refer to? I'm asking because I don't see you
| in the conversation and you comments seems an out of context
| self-promoting plug.
| kevinsundar wrote:
| Hey! I'm sorry you feel that way. There's several people who
| have subscribed to updates to OpenAI from my comment so there
| is clearly value to other commenters. I understand not
| everyone is interested though. It's just a free side project
| I built and I make no money.
|
| Additionally, I believe my contribution to the conversation
| is that gpt-4o-mini, the previous model advertised as low-
| cost, works pretty well for my use case (which in this case
| can help others here). I'm excited to try out gpt-03-mini
| depending on what the cost looks like for web scraping
| purposes. Happy to report back here once I try it out.
| ryanhecht wrote:
| > While OpenAI o1 remains our broader general knowledge reasoning
| model, OpenAI o3-mini provides a specialized alternative for
| technical domains requiring precision and speed.
|
| I feel like this naming scheme is growing a little tired. o1 is
| for general knowledge reasoning, o3-mini replaces o1-mini but
| might be more specialized than o1 for certain technical
| domains...the "o" in "4o" is for "omni" (referring to its
| multimodality) but the reasoning models start with "o" ...but
| they can't use "o2" for trademark reasons so they skip straight
| to "o3" ...the word salad is getting really hard to follow!
| kingnothing wrote:
| They really need someone in marketing.
|
| If the model is for technical stuff, then call it the technical
| model. How is anyone supposed to know what these model names
| mean?
|
| The only page of theirs attempting to explain this is a total
| disaster. https://platform.openai.com/docs/models
| rowanG077 wrote:
| If marketing terms from intel, AMD, Dell and other tech
| companies have taught me anything, it's that they need LESS
| of people in marketing.
| TeMPOraL wrote:
| But think of all the other marketers whose job is to
| produce blogspam explaining confusing product names!
| ninetyninenine wrote:
| I bet you can get one of their models to fix that disaster.
| ryanhecht wrote:
| But what would we call that model?
| aleph_minus_one wrote:
| > But what would we call that model?
|
| Ask one of their models for advice. :-)
| ryanhecht wrote:
| Reminds me of a joke in the musical "How to Succeed in
| Business Without Really Trying" (written in 1961):
|
| PETERSON Oh say, Tackaberry, did you get my memo?
|
| TACKABERRY What memo?
|
| PETERSON My memo about memos. We're sending out too many
| memos and it's got to stop!
|
| TACKABERRY All right. I'll send out a memo.
| ninetyninenine wrote:
| Let's call it "O5 Pro Max Elite"--because if nonsense
| naming works for smartphones, why not AI models?
| ryandrake wrote:
| O5 Pro Max Elite Enterprise Edition with Ultra
| TeMPOraL wrote:
| Maybe they could start selling "season passes" next to
| make their offering even more clear!
| n2d4 wrote:
| > They really need someone in marketing.
|
| Who said this is not intentional? It seems to work well given
| that people are hyped every time there's a release, no matter
| how big the actual improvements are -- I'm pretty sure
| "o3-mini" works better for that purpose than "GPT 4.1.3"
| fkyoureadthedoc wrote:
| > I'm pretty sure "o3-mini" works better for that purpose
| than "GPT 4.1.3"
|
| Why would the marketing team of all people call it GPT
| 4.1.3?
| n2d4 wrote:
| They wouldn't! They would call it o3-mini, even though
| GPT 4.1.3 may or may not "make more sense" from a
| technical perspective.
| ryanhecht wrote:
| Ugh, and some of the rows of that table are "sets of models"
| while some are singular models...there's the "Flagship
| models" section at the top only for "GPT models" to be
| heralded as "Our fast, versatile, high intelligence flagship
| models" in the NEXT section...
|
| ...I like "DALL*E" and "Whisper" as names a lot, though, FWIW
| :p
| golly_ned wrote:
| Yes, this $300Bn company generating +$3.4Bn in revenue needs
| to hire marketing expert. They can begin by sourcing ideas
| from us here to save their struggling business from total
| marketing disaster.
| winrid wrote:
| At the least they should care more about UX. I have no idea
| how to restore the sidebar on chatgpt on desktop lol
| Legend2440 wrote:
| Click the 'open sidebar' icon in the top left corner of
| the screen.
| winrid wrote:
| There isn't one, unless they fixed it today. Just a down
| arrow to change the model.
| optimalsolver wrote:
| >this $300Bn company
|
| Watch this space.
| avs733 wrote:
| Hype based marketing can be effective but it is high risk
| and unstable.
|
| A marketing team isn't a generality that makes a company
| known, it often focuses on communicating what products
| different types of customers need from your lineup.
|
| If I sell three medications:
|
| Steve
|
| 56285
|
| Priximetrin
|
| And only tell you they are all pain killers but for
| different types and levels of pain I'm going to leave
| revenue on the floor. That is no matter how valuable my
| business is or how well it's known.
| TeMPOraL wrote:
| > _How is anyone supposed to know what these model names
| mean?_
|
| Normies don't have to know - ChatGPT app focuses UX around
| capabilities and automatically picks the appropriate model
| for capabilities requested; you can see which model you're
| using and change it, but _don 't need to_.
|
| As for the techies and self-proclaimed "AI experts" - OpenAI
| is the leader in the field, and one of the most well-known
| and talked about tech companies in history. Whether to use,
| praise or criticize, this group of users is motivated to
| figure it out on their own.
|
| It's the privilege of fashionable companies. They could name
| the next model ((|))-[?][?], and it'll take all of five
| minutes for everyone in tech (and everyone on LinkedIn) to
| learn how to type in the right Unicode characters.
|
| EDIT: Originally I wrote \Omega-[?][?], but apparently HN's
| Unicode filter extends to Greek alphabet now? 'dang?
| relaxing wrote:
| What if you use ASCII 234? O (edit: works!)
| TeMPOraL wrote:
| Thanks! I copied mine from Wikipedia (like I typically do
| with Unicode characters I rarely use), where it is also O
| - the same character. For a moment I was worried I
| somehow got it mixed up with the Ohm symbol but I didn't.
| Not sure what happened here.
| koakuma-chan wrote:
| Name is just a label. It's not supposed to mean anything.
| ninetyninenine wrote:
| Think how awesome the world would be if labels ALSO had
| meanings.
| koakuma-chan wrote:
| As someone else said in another thread, if you could
| derive the definition from a word, the word would be as
| long as the definition, which would defeat the purpose.
| ninetyninenine wrote:
| Im not saying words. Im saying labels.
|
| You use words as labels so that we use our pre existing
| knowledge of the word to derive meaning from the label.
| TeMPOraL wrote:
| There is no such thing. "Meaning" isn't a property of a
| label, it arises from how that label is used with other
| labels in communication.
|
| It's actually the reason LLMs work in the first place.
| optimalsolver wrote:
| You're gonna need to ground those labels in something
| physical at some point.
|
| No one's going to let an LLM near anything important
| until then.
| TeMPOraL wrote:
| You only need it for bootstrapping. Fortunately, we've
| already done that when we invented first languages. LLMs
| are just bootstrapping off us.
| layer8 wrote:
| Inscrutable naming is a proven strategy for muddying the
| waters.
| jtwaleson wrote:
| Salesforce would like a word...
| SAI_Peregrinus wrote:
| The USB-IF as well. Retroactively changing the name of a
| previous standard was particularly ridiculous. It's always
| been USB 3.1 Gen 1 like we've always been at war with
| Eastasia.
| unsupp0rted wrote:
| This is definitely intentional.
|
| You can like Sama or dislike him, but he knows how to market a
| product. Maybe this is a bad call on his part, but it is a
| call.
| thorum wrote:
| Not really. They're successful because they created one of
| the most interesting products in human history, not because
| they have any idea how to brand it.
| marko-k wrote:
| If that were the case, they'd be neck and neck with
| Anthropic and Claude. But ChatGPT has far more market share
| and name recognition, especially among normies. Branding
| clearly plays a huge role.
| KeplerBoy wrote:
| That's first mover advantage.
| bobxmax wrote:
| I think that has more to do with the multiple year head
| start and multiple tens of billions of dollars in funding
| advantage.
| joshstrange wrote:
| And you think that is due to their model naming?
| cj wrote:
| ChatGPT is still benefitting from first mover advantage.
| Which they've leveraged to get to the position they're at
| today.
|
| Over time, competitors catch up and first mover advantage
| melts away.
|
| I wouldn't attribute OpenAI's success to any extremely
| smart marketing moves. I think a big part of their market
| share grab was simply going (and staying) viral for a
| long time. Manufacturing virality is notoriously
| difficult (and based on the usability and poor UI of
| ChatGPT early versions, it feels like they got lucky in a
| lot of ways)
| jcheng wrote:
| I prefer Anthropic's models but ChatGPT (the web
| interface) is far superior to Claude IMHO. Web search,
| long-term memory, and chat history sharing are hard to
| give up.
| mrbungie wrote:
| That's like making a second reading and appealing to
| authority.
|
| The naming is bad. Other people already said it you can
| "google" stuff, you can "deepseek" something, but to
| "chatgpt" sounds weird.
|
| The model naming is even weirder, like, did they really avoid
| o2 because of oxigen?
| sumedh wrote:
| > but to "chatgpt" sounds weird.
|
| People just say it differently, they say "ask chatgpt"
| mrbungie wrote:
| Obviously they do. That's the whole point.
| gwd wrote:
| I normally use Claude, but "Ask Claude", but unless it's
| someone who knows me well, I say "Ask ChatGPT", or it's
| just not as claer; and I don't think it's primarily due
| to popularity.
| FridgeSeal wrote:
| I think it's success in spite of branding, not because of it.
|
| This naming scheme is a dumpster fire. Every other comment is
| trying to untangle what the actual hierarchy of model
| performance is.
| zamadatix wrote:
| The -mini postfix makes perfect sense, probably even clearer
| than the old "turbo" wording. Naturally, the latest small model
| may be better than larger older models... but not always and
| not necessarily in everything. What you'd expect from a -mini
| model is exactly what is delivered.
|
| The non-reasoning line was also pretty straightforward. Newer
| base models get a larger prefix number and some postfixes like
| 'o' were added to signal specific features in each model
| variant. Great!
|
| Where things went of the rails was specifically when they
| decided to also name the reasoning models with an 'o' for
| separate reasons but now as the prefix at the same time as
| starting a separate linear sequence but now as the postfix. I
| wonder if we'll end up with both a 4o and o4...
| lolinder wrote:
| > I wonder if we'll end up with both a 4o and o4...
|
| The perplexing thing is that _someone_ has to have said that,
| right? It has to have been brought up in some meeting when
| they were brainstorming names that if you have 4o and o1 with
| the intention of incrementing o1 you 'll eventually end up
| with an o4.
|
| Where they really went off the rails was not just bailing
| when they realized they couldn't use o2. In that moment they
| had the chance to just make o1 a one-off weird name and go
| down a different path for its final branding.
|
| OpenAI just struggles with names in general, though. ChatGPT
| was a terrible name picked by engineers for a product that
| wasn't supposed to become wildly successful, and they haven't
| really improved at it since.
| viraptor wrote:
| The obvious solution could be to just keep skipping the
| even numbers and go to o5.
| arrowleaf wrote:
| Or further the hype and name it o9.
| macrolime wrote:
| And multimodal o4 should be o4o.
| tmnvdb wrote:
| Probably they are doing so well because there are not
| endless meetings on customer friendly names
| cruffle_duffle wrote:
| Why not let ChatGPT decide the naming? Surely it will be
| replacing humans at this task any day now?
| observationist wrote:
| They should be calling it ChatGPT and ChatGPT-mini, with other
| models hidden behind some sort of advanced mode power user
| menu. They can roll out major and minor updates by number. The
| whole point of differentiating between models is to get users
| to self limit the compute they consume - rate limits make
| people avoid using the more powerful models, and if they have a
| bad experience using the less capable models, or if they're
| frustrated by hopping between versions without some sort of
| nuanced technical understanding, it's just a bad experience
| overall.
|
| OpenAI is so scattered they haven't even bothered using their
| own state of the art AI to come up with a coherent naming
| convention? C'mon, get your shit together.
| TeMPOraL wrote:
| "ChatGPT" (chatgpt-4o) is now its own model, distinct from
| gpt-4o.
|
| As for self-limiting usage by non-power users, they're
| already doing that: ChatGPT app automatically picks a model
| depending on what capabilities you invoke. While they provide
| a limited ability to see and switch the model in use, they're
| clearly expecting regular users not to care, and design their
| app around that.
| observationist wrote:
| None of that matters to normal users, and you could satisfy
| power users with serial numbers or even unique ideograms.
| Naming isn't _that_ hard, and their models are surprisingly
| adept at it. A consistent naming scheme improves customer
| experience by preventing confusion - when a new model comes
| out, I field questions for days from friends and family -
| "what does this mean? which model should i use? Aww, I have
| to download another update?" and so on. None of the stated
| reasons for not having a coherent naming convention for
| their models are valid. I'd be upset as a stakeholder,
| they're burning credibility and marketing power for no good
| reason.
|
| modelname(variant).majorVersion.minorVersion ChatGPT(o).3.0
| ChatGPT-mini(o).3.0 GPT.2.123 GPT.3.9
|
| And so on. Once it's coherent, people pick it up, and
| naturally call the model by "modelname majorversion" , and
| there's no confusion or hesitance about which is which.
| See, it took me 2 minutes.
|
| Even better: Have an OAI slack discussion company-wide,
| then have managers summarize their team's discussions into
| a prompt demonstrating what features they want out of it,
| then run all the prompts together and tell the AI to put
| together 3 different naming schemes based on all the
| features the employees want. Roll out a poll and have
| employees vote which of the 3 gets used going forward. Or
| just tap into that founder mode and pick one like a boss.
|
| Don't get me wrong, I love using AI - we are smack dab in
| the middle of a revolution and normal people aren't quite
| catching on yet, so it's exhilarating and empowering to be
| able to use this stuff, like being one of the early users
| of the internet. We can see what's coming, and if you lived
| through the internet growing up, you know there's going to
| be massive, unexpected synergies and developments of
| systems and phenomena we don't yet have the words for.
|
| OpenAI can do better, and they should.
| TeMPOraL wrote:
| I agree with your observations, and that they both could
| and should do better. However, they have the privilege of
| being _the_ AI company, the most hyped-up brand in the
| most hyped-up segment of economy - at this point, the
| impact of their naming strategy is approximately nil.
| Sure, they 're confusing their users a bit, but their
| users are _very highly motivated_.
|
| It's like with videogames - most of them commit all kinds
| of UI/UX sins, and I often wish they didn't, but
| excepting extreme cases, the players are too motivated to
| care or notice.
| fourseventy wrote:
| It's almost as bad as the Xbox naming scheme.
| Someone1234 wrote:
| I don't know if anything is as bad as a games console named
| "Series."
| siliconc0w wrote:
| The real heated contest here amongst the top AI labs is to see
| who can come up with the most confusing product names.
| not_a_bot_4sho wrote:
| Someone dropped the ball with Phi models. There is clearly an
| opportunity for XP and Ultimate and X/S editions.
| lja wrote:
| I really think a "OpenAI Me" is what's needed.
| baq wrote:
| Personally waiting for the ME model. Should be great at jokes
| and humor.
| tdb7893 wrote:
| It's nice to see Google finally having competition in a space
| it used to really dominate (though they definitely still are
| holding their own with all the Gemini naming). I feel like it
| takes real effort to have product names be this confusing and
| capricious
| gundmc wrote:
| Gemini naming seems pretty straightforward at this point. 2.0
| is the full model, flash is a smaller/faster/cheaper model,
| and flash thinking is a smaller/faster/cheaper reasoning
| model with Cost.
| coder543 wrote:
| > 2.0 is the full model
|
| Not quite. "2.0 Flash" is also called 2.0. The "Pro" models
| are the full models. But, I love how they have both
| "gemini-exp-1206" and "gemini-2.0-flash-thinking-
| exp-01-21". The first one doesn't even say what type of
| model it is, presumably it should have been
| "gemini-2.0-pro-exp-1206", but they didn't want to label it
| that for some reason, and now they're putting a hyphen in
| the date string where they weren't before.
|
| Not to mention they have both "Flash" and "Flash-8B"...
| which I think will confuse people. IMO, it should be
| "Flash-${Parameters}B" for both of them if they're going to
| mention it for one.
|
| But, I generally think Google's Gemini naming structure has
| been pretty decent.
| TheOtherHobbes wrote:
| Surprised Apple hasn't gone with iI Pro Max.
| dilap wrote:
| Haven't used openai in a bit -- whyyy did they change "system"
| role (now basically an industry-wide standard) to "developer"?
| That seems pointlessly disruptive.
| logicchains wrote:
| They mention in the model card, it's so that they can have a
| separate "system" role that the user can't change, and they
| trained the model to prioritise it over the "developer" role,
| to combat "jailbreaks". Thank God for DeepSeek.
| sroussey wrote:
| They should have just created something above system and left
| as it was.
| Etheryte wrote:
| Agreed, just add root and call it a day. Everyone who needs
| to care can instantly guesstimate what it is.
| BoorishBears wrote:
| 2 years ago I'd say it's an oversight, because there's 0 chance
| a top down directive would ask for this.
|
| But given how OpenAI employees act online these days I wouldn't
| be surprised if someone on the ground proposed it as a way to
| screw with all the 3rd parties who are using OpenAI compatible
| endpoints or even use OpenAI's SDK in their official docs in
| some cases.
| kaaskop wrote:
| How's this compare to Mistral Small 3?
| coder543 wrote:
| Mistral Small 3 is roughly comparable in capabilities to
| 4o-mini (apart from 4o-mini's support for multimodality)...
| o1-mini was already better than GPT-4o (full size) for tasks
| like writing code, and this is supposedly better than o1 (full
| size) for those tasks, so... o3-mini is supposedly in a
| completely different league from Mistral Small 3, and it's not
| even close.
|
| Of course, the model has only been out for a few hours, so
| whether it lives up to the benchmarks or not isn't really known
| yet.
| highfrequency wrote:
| Anyone else confused by inconsistency in performance numbers
| between this announcement and the concurrent system card?
| https://cdn.openai.com/o3-mini-system-card.pdf
|
| For example-
|
| GPQA diamond system card: o1-preview 0.68
|
| GPQA diamond PR release: o1-preview 0.78
|
| Also, how should we interpret the 3 different shading colors in
| the barplots (white, dotted, heavy dotted on top of white)...
| kkzz99 wrote:
| Actually sounds like benchslop to me.
| airstrike wrote:
| Hopefully this is a big improvement from o1.
|
| o1 has been very disappointing after spending sufficient time
| with Claude Sonnet 3.5. It's like it actively tries to gaslight
| me and thinks it knows more than I do. It's too stubborn and
| confidently goes off in tangents, suggesting big changes to parts
| of the code that aren't the issue. Claude tends to be way better
| at putting the pieces together in its not-quite-mental-model, so
| to speak.
|
| I told o1 that a suggestion it gave me didn't work and it said
| "if it's still 'doesn't work' in your setup..." with "doesn't
| work" in quotes like it was doubting me... I've canceled my
| ChatGPT subscription and, when I really need to use it, just go
| with GPT-4o instead.
| Deegy wrote:
| I've also noticed that with cGPT.
|
| That said I often run into a sort of opposite issue with
| Claude. It's very good at making me feel like a genius.
| Sometimes I'll suggest trying a specific strategy or trying to
| define a concept on my own, and Claude enthusiastically agrees
| and takes us down a 2-3 hour rabbit hole that ends up being
| quite a waste of time for me to back track out of.
|
| I'll then run a post-mortem through chatGPT and very often it
| points out the issue in my thinking very quickly.
|
| That said I keep coming back to sonnet-3.5 for reasons I can't
| perfectly articulate. Perhaps because I like how it fluffs my
| ego lol. ChatGPT on the other hand feels a bit more brash. I do
| wonder if I should be using o1 as my daily driver.
|
| I also don't have enough experience with o1 to determine if it
| would also take me down dead ends as well.
| bazmattaz wrote:
| Really interesting point you make about Claude. I've
| experienced the same. What is interesting is that sometimes
| I'll question it and say "would it not be better to do it
| this way" and all of a sudden Claude u-turns and says "yes
| great idea that's actually a much better approach" which
| leaves me thinking; are you just stroking my ego, if it's a
| better approach then why didn't you suggest it?
|
| However I have suggested worse approaches on purpose and
| sometime Claude does pick them up as less than optimal
| airstrike wrote:
| I agree with this but o1 will _also_ confidently take you
| into rabbit holes. You 'll just feel worse about it lol and
| when you ask Claude for a post mortem, it too will find the
| answer you missed quickly
|
| The truth is these models are very stochastic you have to try
| new chats whenever you even moderately suspect you're going
| awry
| mordae wrote:
| It's a little sycophant.
|
| But the difference is that it actually asks questions. And
| also that it actually rolls with what you ask it to do. Other
| models are stubborn and loopy.
| ilaksh wrote:
| It looks like a pretty significant increase on SWE-Bench.
| Although that makes me wonder if there was some formatting or
| gotcha that was holding the results back before.
|
| If this will work for your use case then it could be a huge
| discount versus o1. Worth trying again if o1-mini couldn't handle
| the task before. $4/million output tokens versus $60.
|
| https://platform.openai.com/docs/pricing
|
| I am Tier 5 but I don't believe I have access to it in the API
| (at least it's not on the limits page and I haven't received an
| email). It says "rolling out to select Tier 3-5 customers" which
| means I will have to wait around and just be lucky I guess.
| TechDebtDevin wrote:
| Genuinely curious, What made you choose OpenAI as your
| preferred api provider? Its always been the least attractive to
| me.
| TeMPOraL wrote:
| Until recently they were the only game in town, so maybe they
| accrued significant spend back then?
| ilaksh wrote:
| I have mainly been using Claude 3.5/3.6 Sonnet via API in the
| last several months (or since 3.5 Sonnet came out). However,
| I was using o1 for a challenging task at one point, but last
| I tested it had issues with some extra backslashes for that
| application.
|
| I also have tested with DeepSeek R1 and will test some more
| with that although in a way Claude 3.6 with CoT is pretty
| good. Last time I tried to test R1 their API was out.
| ipaddr wrote:
| Who else might be a good choice? Deepseek is down. Who has
| the cheapest gpt3.5 level or above api
| TechDebtDevin wrote:
| Ive personaly been using Deepseek (which has been better
| than for 3.5 for a really long time), and Perplexity, which
| is nice for their built in search. Ive actually been using
| Deepseek since it was free. Its been generally good for me.
| Ive mostly chosen both because of pricing as I generally
| dont use APIs for extermely complex prompts.
| Aperocky wrote:
| Run it locally, the distilled smaller ones aren't bad at
| all.
| eknkc wrote:
| We extensively used the batch APIs to decrease cost and
| handle large amount of data. I also need JSON responses for a
| lot of things and OpenAI seem to have the best json schema
| output option out there.
| TeMPOraL wrote:
| Tier 3 here and already see it on Limits page, so maybe the
| wait won't be long.
| ilaksh wrote:
| Yep, I got an email about o3-mini in the API an hour ago.
| TeMPOraL wrote:
| I apparently got one at the same time too, but I missed it
| distracted by this HN thread :). Not only I got o3-mini
| (which I already noticed on the Limits page), but they also
| gave me access to o1 now! I'm Tier 3; until yesterday, o1
| was still Tier 5 (IIRC).
|
| Thanks OpenAI! Nice gift and a neat distraction from
| DeepSeek-R1 - which I still can't use directly, because
| their API stopped working moments after I topped up my
| credits and generated an API key, and is still down for
| me... :/.
| sshh12 wrote:
| Tier 5 and I got it almost instantly
| georgewsinger wrote:
| Did anyone else notice that o3-mini's SWE bench dropped from 61%
| in the leaked System Card earlier today to 49.3% in this blog
| post, which puts o3-mini back in line with Claude on real-world
| coding tasks?
|
| Am I missing something?
| logicchains wrote:
| Maybe they found a need to quantize it further for release, or
| lobotomise it with more "alignment".
| kkzz99 wrote:
| Or the number was never real to begin with.
| ben_w wrote:
| > lobotomise
|
| Anyone can write very fast software if you don't mind it
| sometimes crashing or having weird bugs.
|
| Why do people try to meme as if AI is different? It has
| unexpected outputs sometimes, getting it to not do that is
| 50% "more alignment" and 50% "hallucinate less".
|
| Just today I saw someone get the Amazon bot to roleplay furry
| erotica. Funny, sure, but it's still obviously a bug that a
| *sales bot* would do that.
|
| And given these models do actually get stuff wrong, is it
| really _incorrect_ for them to refuse to help with things
| they might be dangerous if the user isn 't already skilled,
| like Claude in this story about DIY fusion?
| https://www.corememory.com/p/a-young-man-used-ai-to-
| build-a-...
| Rastonbury wrote:
| They are implying the release was rushed and they had to
| reduce the functionality of the model in order to make sure
| it did not teach people how to make dirty bombs
| bee_rider wrote:
| If somebody wants their Amazon bot to role play as an
| erotic furry, that's up to them, right? Who cares. It is
| working as intended if it keeps them going back to the site
| and buying things I guess.
|
| I don't know why somebody would want that, seems annoying.
| But I also don't expect people to explain why they do this
| kind of stuff.
| ben_w wrote:
| It's still a bug. Not really working as intended -- it
| doesn't sell anything from that.
|
| A very funny bug, but a bug nonetheless.
|
| And given this was shared via screenshots, it was done
| for a laugh.
| jakereps wrote:
| The caption on the graph explains.
|
| > including with the open-source Agentless scaffold (39%) and
| an internal tools scaffold (61%), see our system card .
|
| I have no idea what an "internal tools scaffold" is but the
| graph on the card that they link directly to specifies "o3-mini
| (tools)" where the blog post is talking about others.
| DrewHintz wrote:
| I'm guessing an "internal tools scaffold" is something like
| Goose: https://github.com/block/goose
|
| Instead of just generating a patch (copilot style), it
| generates the patch, applies the patch, runs the code, and
| then iterates based on the execution output.
| anothermathbozo wrote:
| I think this is with and without "tools." They explain it in
| the system card:
|
| > We evaluate SWE-bench in two settings: > ** Agentless*, which
| is used for all models except o3-mini (tools). This setting
| uses the Agentless 1.0 scaffold, and models are given 5 tries
| to generate a candidate patch. We compute pass@1 by averaging
| the per-instance pass rates of all samples that generated a
| valid (i.e., non-empty) patch. If the model fails to generate a
| valid patch on every attempt, that instance is considered
| incorrect.
|
| > ** o3-mini (tools)*, which uses an internal tool scaffold
| designed for efficient iterative file editing and debugging. In
| this setting, we average over 4 tries per instance to compute
| pass@1 (unlike Agentless, the error rate does not significantly
| impact results). o3-mini (tools) was evaluated using a non-
| final checkpoint that differs slightly from the o3-mini launch
| candidate.
| georgewsinger wrote:
| Makes sense. Thanks for the correction.
| Bjorkbat wrote:
| So am I to understand that they used their internal tooling
| scaffold on the o3(tools) results only? Because if so, I
| really don't like that.
|
| While it's nonetheless impressive that they scored 61% on
| SWE-bench with o3-mini combined with their tool scaffolding,
| comparing Agentless performance with other models seems less
| impressive, 40% vs 35% when compared to o1-mini if you look
| at the graph on page 28 of their system card pdf
| (https://cdn.openai.com/o3-mini-system-card.pdf).
|
| It just feels like data manipulation to suggest that o3-mini
| is much more performant than past models. A fairer picture
| would still paint a performance improvement, but it look less
| exciting and more incremental.
|
| Of course the real improvement is cost, but still, it kind of
| rubs me the wrong way.
| pockmarked19 wrote:
| YC usually says "a startup is the point in your life where
| tricks stop working".
|
| Sam Altman is somehow finding this out now, the hard way.
|
| Most paying customers will find out within minutes whether
| the models can serve their use case, a benchmark isn't
| going to change that except for media manipulation (and
| even that doesn't work all that well, since journalists
| don't really know what they are saying and readers can
| tell).
| OutOfHere wrote:
| Wake me up when the full o3 is out.
| therein wrote:
| My guess is it will happen right after Sam Altman's next public
| freakout about how dangerous this new model they have in store
| is and how it tried to escape from its confinement and kidnap
| the alignment operator.
| ls_stats wrote:
| That's pretty much what Altman said about GPT-3 (or 2, I
| don't remember), he said it was too dangerous to release to
| the public.
| msp26 wrote:
| I wish they'd just reveal the CoT (like gemini and deepseek do),
| it's very helpful to see when the model gets misled by something
| in your prompt. Paying for tokens you aren't even allowed to see
| is peak OpenAI.
| tucnak wrote:
| I'm sorry, but it's over for OpenAI. Some have predicted this;
| including me back in November[1] when I wrote "o1 is a
| revolution in accounting, not capability" which although
| tongue-in-cheek, has so far turned out to be correct. I'm only
| waiting to see what Google, Facebook et al. will accomplish now
| that R1-Zero result is out the bag. The nerve, the cheek of
| this hysterical o3-mini release--insisting to hide the COT from
| the consumer still, is telling us one thing and one thing
| alone: OpenAI is no longer able to adapt to the ever-changing
| landscape. Maybe the Chinese haven't beaten them yet, but
| Google, Facebook et al. absolutely will, & without having to
| resort to deception.
|
| [1]:
| https://old.reddit.com/r/LocalLLaMA/comments/1gna0nr/popular...
| mediaman wrote:
| You don't need to wait for Google. Their Jan 21 checkpoint
| for their fast reasoning model is available on AIStudio. It
| shows full reasoning traces. It's very good, much faster than
| R1, and although they haven't released pricing, based on
| flash it's going to be quite cheap.
| tucnak wrote:
| Sure, their 01-21 reasoning model is really good, but
| there's no pricing for it!
|
| I care mostly about batching in Vertex AI, which is 17-30x
| times cheaper than competition (whether you use prompt
| caching or not) while allowing for audio, video, and
| arbitrary document filetype inputs; unfortunately Gemini
| 1.5 Pro/Flash have remained the two so-called "stable"
| options that are available there. I can appreciate Google's
| experimental models for all I can, but I cannot take them
| seriously until they allow me to have my sweet, sweet
| batches.
| liamwire wrote:
| sama and OpenAI's CPO Kevin Weil both suggested this is coming
| soon, as a direct response to DeepSeek, in an AMA a few hours
| ago: https://www.reddit.com/r/OpenAI/s/EElFfcU8ZO
| kumarm wrote:
| I ran some quick programming tasks I have used O1 previously:
|
| 1. 1/4th time for reasoning for most tasks.
|
| 2. Far better results.
| CamperBob2 wrote:
| Compared to o1 or o1-pro?
| yzydserd wrote:
| A few quick tasks look to me like o3-mini-high is 4-10x
| faster for 80% of the quality. It gives very good and
| sufficient fast reasoning about coding tasks, but I think I'd
| ask o1-pro to do the task ie provide the code. o3-mini-high
| can keep up with you at thinking / typing speed, whereas
| o1-pro can take several minutes. Just a quick view after
| playing for an hour.
| Bjorkbat wrote:
| I have to admit I'm kind of surprised by the SWE-bench results.
| At the highest level of performance o3-mini's CodeForces score
| is, well, high. I've honestly never really sat down to understand
| how elo works, all I know is that it scored better than o1, which
| allegedly as better than ~90% of all competitors on CodeForces.
| So, you know, o3-mini is pretty good at CodeForces.
|
| But it's SWE-bench scores aren't meaningfully better than Claude,
| 49.3 vs Claude's 49.0 on the public leaderboard (might be higher
| now due to recent updates?)
|
| My immediate thoughts, CodeForces (and competitive programming in
| general) is a poor proxy for performance on general software
| engineering tasks. Besides that, for all the work put into
| OpenAI's most recent model it still has a hard time living up to
| an LLM initially released by Anthropic some time ago, at least
| according to this benchmark.
|
| Mind you, the Github issues that the problems in SWE-bench were
| based-off have been around long enough that it's pretty much a
| given that they've all found their way into the training data of
| most modern LLMs, so I'm really surprised that o3 isn't
| meaningfully better than Sonnet.
| dagelf wrote:
| I think the innovation here is probably that its a much smaller
| and so cheaper model to run.
| aprilthird2021 wrote:
| > My immediate thoughts, CodeForces (and competitive
| programming in general) is a poor proxy for performance on
| general software engineering task
|
| Yep. A general software engineering task has a lot of
| information encoded in it that is either already known to a
| human or is contextually understood by a human.
|
| A competitive programming task often has to provide all the
| context as it's not based off an existing product or codebase
| or technology or paradigm known to the user
| vectorhacker wrote:
| Yeah, I no longer consider the SWE-bench useful because these
| models can just "memorize" the solutions to the PRs.
| _boffin_ wrote:
| why is o1-pro not mentioned in there?
| Oras wrote:
| 200k context window
|
| $1.1/m for input
|
| $4.4/m for output
|
| I assume thinking medium and hard would consume more tokens.
|
| I feel the timing is bad for this release especially when
| deepseek R1 is still peaking. People will compare and might get
| disappointed with this model.
| GaggiX wrote:
| The model looks quite a bit better in the benchmarks so unless
| they overfit the model on them it would probably perform better
| than deepseek.
| WiSaGaN wrote:
| My vibe question checking suggests otherwise. Even o3-mini-
| high is not as good as r1, even though it's faster than r1.
| Considering o3-mini is more expensive per token. It's not
| clear o3-mini-high is cheaper than r1 either even r1 probably
| consumes more token per answer.
| kandesbunzler wrote:
| well in my anecdotal tests, o3 mini (free) performed better
| than r1
| GaggiX wrote:
| Also in my coding testing o3 mini (free) is better than
| r1.
| WiSaGaN wrote:
| I did math tests. Probably you did coding.
| kandesbunzler wrote:
| I compared free o3 mini vs Deepseek R1 (on their website) and
| in my tests o3 performed better every time (did some coding
| tests)
| IMTDb wrote:
| I really don't get the point of those oX-mini models for chat
| apps. (API is different, we can benchmark multiple models for a
| given recurring taks and choose the best one taking costs into
| consideration). As part of my job, I am trying to promote usage
| of AI in my company (~150 FTE); we have an OpenAI chatGPT plus
| subscription for all employees.
|
| Roughly speaking the message is: "use GPT-4o all the time, use o1
| (soon o3) if you have more complex tasks". What am I supposed to
| answer when people ask "when am I supposed to use o3-mini ? . And
| what the heck is o3-mini-high, how do I know when to use it ?".
| People aren't gonna ask the same question to 5 different models
| and burn all their rate limits; yet it feels that what's openAI
| is hoping people will do.
|
| Put those weirs models in a sub-menu for advanced users if you
| really want to, but is you can use o1 there is probably no reason
| for you to hake o3-mini _and_ o3-mini-high as additional options.
| oezi wrote:
| Why not promote o1? 4o is rather sloppy in comparison
| IMTDb wrote:
| 99% of what people use ChatGPT is for very mundane stuff.
| Think "translate this email to English", "fix spelling
| mistakes", "write this better for me". Data extraction (list
| of emails) is big as well. You don't need o1 for that; and
| people make lot of those requests per day.
|
| Additionally, o1 does not have access to search and
| multimodality and taking a screenshot of something and asking
| questions about it is also a big use case.
|
| It's easy to overlook how widely ChatGPT is used for _very_
| small stuff. But compounded it's still a game changer for
| many people.
| xinayder wrote:
| "oh no DeepSeek copied our product it's not fair"
|
| > proceeds to release a product based on DeepSeek
|
| ah, alas the hypocrisy...
| feznyng wrote:
| o3 was announced in December. R1 arguably builds off the
| rumored approach of o1 (LLM + RL) although with major
| efficiency gains. I'm not a big fan of OpenAI but it's the
| other way around.
| Rooster61 wrote:
| The thing they previewed back in December before the whole
| Deepseek kerfuffle this week?
|
| Don't get me wrong, I'm laughing at OpenAI just like everyone
| else, but if they were really copying Deepseek, they'd be
| releasing a smaller model distilled from Deepseek API
| responses, and have it be open source to boot. This is neither
| yapyap wrote:
| They sure scrambled something together after DeepSeek sweeped the
| market.
| GoatInGrey wrote:
| Indeed. Everyone knows that one can cobble together a frontier
| model and deploy it within three weeks.
| TechDebtDevin wrote:
| Not to mention the model has been available to researchers
| for a month.
| mhb wrote:
| Maybe they can get some advice from the AWS instance naming
| group.
| og_kalu wrote:
| R1 seems to be the only of these reasoning models that seem to
| have had gains in the creative writing side.
| nimithryn wrote:
| Am I the only one who thinks that R1 is _awful_ at creative
| writing? I 've seen a lot of very credulous posts on twitter
| that are super excited about excerpts written by DeepSeek that
| I think are absolutely absymal. Am I alone in this? Maybe
| people have very different tastes than I do?
|
| (I have no formal training in creative writing, though I do
| read a lot of literature. Not claiming my tastes are superior -
| genuinely curious if other people disagree).
| og_kalu wrote:
| I mean, do you think this is awful ?
|
| https://pastebin.com/Ja14mt6L
| throwaway314155 wrote:
| Typical OpenAI release announcement where it turns out they're
| _actually_ doing some sort of delayed rollout and despite what
| the announcement says, no - you can't use o3-mini today.
| feverzsj wrote:
| It's already a dead end for a while now, as they can't improve o1
| meaningfully anymore. The market is also losing patience quickly.
| czk wrote:
| im just glad it looks like o3-mini finally has internet access
|
| the o1 models were already so niche that i never used them, but
| not being able to search the web made them even more useless
| oytis wrote:
| Let me guess - everyone is mindblown.
| estsauver wrote:
| I couldn't find in the documentation anything that describes the
| relative number of tokens that you get for low/medium/high. I'm
| curious if anyone can find that, I'd be curious to see how it
| plays out relative to DeepSeeks thinking sections.
| isusmelj wrote:
| Does anyone know why GPT4 has knowledge cutoff December 2023 and
| all the other models (newer ones like 4o, O1, O3) seem to have
| knowledge cutoff October 2023?
| https://platform.openai.com/docs/models#o3-mini
|
| I understand that keeping the same data and curating it might be
| beneficial. But it sounds odd to roll back in time with the
| knowledge cutoff. AFAIK, the only event that happened around that
| time was the start of the Gaza conflict.
| kikki wrote:
| I think trained knowledge is less and less important - as these
| multi-modal models have the ability to search the web and have
| much larger context windows.
| andrewstuart wrote:
| I find Claude to be vastly better than any OpenAI model as a
| programming assistant.
|
| In particular the "reasoning" models just seem to be less good
| and more slow.
| chad1n wrote:
| I think that OpenAI should reduce the prices even further to be
| competitive with Qwen or Deepseek. There are a lot of vendors
| offering Deepseek R1 for $2-2.5 per 1 million tokens output.
| othello wrote:
| Would you have specific recommendations of such vendors?
| chad1n wrote:
| For example, `https://deepinfra.com/` which asks for $2.5 per
| million on output or https://nebius.com which asks for $2.4
| per million output tokens.
| BoorishBears wrote:
| As the sibling comment mentions, you're not getting
| anything production grade for less than $7 per million and
| that's on _input and output_.
|
| Nebius is single digit TPS. _31 seconds_ to reply to
| "What's 1+1".
|
| Hopefully Deepseek will make it out of their current
| situation because in a very ironic way, the thing the
| entire market lost its mind over is not actually usable at
| the pricing that drove the hype:
| https://openrouter.ai/deepseek/deepseek-r1
| druskacik wrote:
| Well, it's $2.19 per million output tokens even directly on
| deepseek platform.
|
| https://api-docs.deepseek.com/quick_start/pricing/
| BoorishBears wrote:
| Their API platform has been down for 48 hours at this point
| rsanek wrote:
| If you want reliable service you're going to pay more around
| $7~8 per million tokens. Sister commenters mention providers
| that are considered unstable
| https://openrouter.ai/deepseek/deepseek-r1
| secondcoming wrote:
| Anyone else stuck in a Cloudflare 'verify you're a human' doom
| loop?
| tempeler wrote:
| They made a discount; it's very impressive; they probably found a
| very efficient way, so it's discounted. I guess there's no need
| to build a very large nuclear power plant or a $9 trillion chip
| factory to run a single large language model. Efficiency has
| skyrocketed, or thanks to competition, OpenAI's all problems were
| solved.
| jen729w wrote:
| > Testers preferred o3-mini's responses to o1-mini 56% of the
| time
|
| I hope by this they don't mean me, when I'm asked 'which of these
| two responses do you prefer'.
|
| They're both 2,000 words, and I asked a question because I have
| something to do. _I 'm not reading them both_; I'm usually just
| selecting the one that answered first.
|
| That prompt is pointless. Perhaps as evidenced by the essentially
| 50% response rate: it's a coin-flip.
| danielmarkbruce wrote:
| RLUHF, U = useless.
| brookst wrote:
| Those prompts are so irritating and so frequent that I've taken
| to just quickly picking whichever one looks worse at a cursory
| glance. I'm paying them, they shouldn't expect high quality
| work from me.
| apparent wrote:
| Have you considered the possibility that your feedback is
| used to choose what type of response to give to you
| specifically in the future?
|
| I would not consider purposely giving inaccurate feedback for
| this reason alone.
| isaacremuant wrote:
| Alternatively, I'll use the tool that is most user friendly
| and provides the most value for my money.
|
| Wasting time on an anti pattern is not value nor is it
| trying to outguess the way that selection mechanism is
| used.
| MattDaEskimo wrote:
| I don't want a model that's customized to my preferences.
| My preferences and understanding changes all the time.
|
| I want a single source model that's grounded in base truth.
| I'll let the model know how to structure it in my prompt.
| szundi wrote:
| Constang meh and fixing prompts to the right direction vs
| unable to escape the bubble
| kenjackson wrote:
| You know there's no such as base truth here? You want to
| write something like this to start your prompts, "Respond
| in English, using standard capitalization and
| punctuation, following rules of grammar as written by
| Strunk & White, where numbers are represented using
| arabic numerals in base 10 notation...."???
| AutistiCoder wrote:
| actually, I might appreciate that.
|
| I like precision of language, so maybe just have a system
| prompt that says "use precise language (ex: no symbolism
| of any kind)"
| MattGaiser wrote:
| A lot of preferences have nothing to do with any truth.
| Do you like code segments or full code? Do you like
| paragraphs or bullet points? Heck, do you want English or
| Japanaese?
| orbital-decay wrote:
| What is base truth for e.g. creative writing?
| francis_lewis wrote:
| I think my awareness that this may influence future
| responses has actually been detrimental to my response
| rate. The responses are often so similar that I can imagine
| preferring either in specific circumstances. While I'm sure
| that can be guided by the prompt, I'm often hesitant to
| click on a specific response as I can see the value of the
| other response in a different situation and I don't want to
| bias the future responses. Maybe with more specific
| prompting this wouldn't be such an issue, or maybe more of
| an understanding of how inter-chat personalisation is
| applied (maybe I'm missing some information on this too).
| Der_Einzige wrote:
| Spotted the pissed off OpenAI RLHF engineer! Hahahahaha!
| Tenoke wrote:
| That's such a counter-productive and frankly dumb thing to
| do. Just don't vote on them.
| explain wrote:
| You have to pick one to continue the chat.
| apparent wrote:
| Why not always pick the one on the left, for example? I
| understand wanting to speed through and not spend time
| doing labor for OpenAI, but it seems counter-productive
| to spend any time feeding it false information.
| brookst wrote:
| My assumption is they measure the quality of user
| feedback, either on a per user basis or in an aggregate.
| I want them to interrupt me less, so I want them to
| either decide I'm a bad teacher or that users in general
| are bad teachers.
| DiggyJohnson wrote:
| I know for a fact that as of yesterday I did not have to
| pick one to continue the conversation. It just maximizes
| the second choice and displayed a 2/2 below the response.
| jackbrookes wrote:
| Yes I'd bet most users just 50/50 it, which actually makes it
| more remarkable that there was a 56% selection rate
| cgriswald wrote:
| I read the one on the left but choose the shorter one.
|
| The interface wastes so much screen real estate already and
| the answers are usually overly verbose unless I've given
| explicit instructions on how to answer.
| ljm wrote:
| The default level of verbosity you get without explicitly
| prompting for it to be succinct makes me think there's an
| office full of workers getting paid by the token.
| internetter wrote:
| In my experience the verbosity significantly improves
| output quality
| johnneville wrote:
| they also pay contractors to do these evaluations with much
| more detailed metrics, no idea which their number is based on
| though
| dkjaudyeqooe wrote:
| It's kind of strange that they gave that stat. Maybe they
| thought people would somehow think about "56% better" or
| something.
|
| Because when you think about it, it really is quite damning.
| Minus statistical noise it's no better.
| fsndz wrote:
| exactly I was surprised as well
| Powdering7082 wrote:
| That would be 12%, why would you assume that is eaten by
| statistical noise?
| senorrib wrote:
| The OPs comment is probably a testament of that. With such
| a poorly designed A/B test I doubt this has a p-value of <
| 0.10.
| throwaway287391 wrote:
| Erm, why not? A 0.56 result with n=1000 ratings is
| statistically significantly better than 0.5 with a
| p-value of 0.00001864, well beyond any standard
| statistical significance threshold I've ever heard of. I
| don't know how many ratings they collected but 1000
| doesn't seem crazy at all. Assuming of course that raters
| are blind to which model is which and the order of the 2
| responses is randomized with every rating -- or, is that
| what you meant by "poorly designed"? If so, where do they
| indicate they failed to randomize/blind the raters?
| n2d4 wrote:
| Because you're not testing "will a user click the left or
| right button" (for which asking a thousand users to click
| a button would be a pretty good estimation), you're
| testing "which response is preferred".
|
| If 10% of people just click based on how fast the
| response was because they don't want to read both
| outputs, your p-value for the latter hypothesis will be
| atrocious, no matter how large the sample is.
| johnmaguire wrote:
| > If 10% of people just click based on how fast the
| response was
|
| Couldn't this be considered a form of preference?
|
| Whether it's the type of preference OpenAI was testing
| for, or the type of preference you care about, is another
| matter.
| n2d4 wrote:
| Sure, it could be, you can define "preference" as
| basically anything, but it just loses its meaning if you
| do that. I think most people would think "56% prefer this
| product" means "when well-informed, 56% of users would
| rather have this product than the other".
| throwaway287391 wrote:
| Yes, I am assuming they evaluated the models in good
| faith, understand how to design a basic user study, and
| therefore when they ran a study intended to compare the
| response quality between two different models, they
| showed the raters both fully-formed responses at the same
| time, regardless of the actual latency of each model.
| n2d4 wrote:
| I would recommend you read the comment that started this
| thread then, because that's the context we're talking
| about: https://news.ycombinator.com/item?id=42891294
| throwaway287391 wrote:
| I did read that comment. I don't think that person is
| saying they were part of the study that OpenAI used to
| evaluate the models. They would probably know if they had
| gotten paid to evaluate LLM responses.
|
| But I'm glad you pointed that out, I now suspect that is
| responsible for a large part of the disagreement between
| "huh? a statistically significant blind evaluation is a
| statistically significant blind evaluation" vs "oh, this
| was obviously a terrible study" repliers is due to
| different interpretations of that post. Thanks. I
| genuinely didn't consider the alternative interpretation
| before.
| godelski wrote:
| > If so, where do they indicate they failed to
| randomize/blind the raters? Win rate if user
| is under time constraint
|
| This is hard to read tbh. Is it STEM? Non-STEM? If it is
| STEM then this shows there is a bias. If it is Non-STEM
| then this shows a bias. If it is a mix, well we can't
| know anything without understanding the split.
|
| Note that Non-STEM is still within error. STEM is less
| than 2 sigma variance, so our confidence still shouldn't
| be that high.
| aqme28 wrote:
| They even include error bars. It doesn't seem to be
| statistical noise, but it's still not great.
| afro88 wrote:
| Yeah. I immediately thought: I wonder if that 56% is in one
| or two categories and the rest are worse?
| rvnx wrote:
| 44% of the people prefers the existing model ?
| KHRZ wrote:
| With many people too lazy to read 2 walls of text, a lot
| of picks might be random.
| afro88 wrote:
| Each question falls into a different category (ie math,
| coding, story writing etc). Typically models are better
| at some categories and worse at others. Saying "56% of
| people preferred responses from o3-mini" makes me wonder
| if those 56 are only from certain categories and the
| model isn't uniformly 56% preferred.
| cm2187 wrote:
| And another way to rephrase it is that almost half of the
| users prefer the older model, which is terrible PR.
| tgsovlerkhgsel wrote:
| Not if the goal is to claim that the models deliver
| comparable quality, but with the new one excelling at
| something else (here: inferrence cost).
| kettleballroll wrote:
| Typically in these tests you have three options "A is
| better", "B is better" or "they're equal/can't decide". So
| if 56% prefer O3 Mini, it's likely that way less than half
| prefer O1.also, the way I understand it, they're comparing
| a mini model with a large one.
| directevolve wrote:
| If you use ChatGPT, it sometimes gives you two versions
| of its response, and you have to choose one or the other
| if you want to continue prompting. Sure, not picking a
| response might be a third category. But if that's how
| they were approaching the analysis, they could have put
| out a more favorable-looking stat.
| ignoramous wrote:
| > _If you use ChatGPT, it sometimes gives you two
| versions_
|
| Does no one else hate it when this happens (especially
| when on a handheld device)?
| m3kw9 wrote:
| It's 3x cheaper and faster
| mikeInAlaska wrote:
| Maybe we should take both answers, paste them into a new chat
| and ask for a summary amalgamation of them
| sharkweek wrote:
| Funny - I had ChatGPT document some stuff for me this week and
| asked which responses I preferred as well.
|
| Didn't bother reading either of them, just selected one and
| went on with my day.
|
| If it were me I would have set up a "hey do you mind if we give
| you two results and you can pick your favorite?" prompt to weed
| out people like me.
| usef- wrote:
| I'm surprised how many people claim to do this. You can just
| not select one.
| rubyn00bie wrote:
| I think it's somewhat natural and am not personally
| surprised. It's easy to quickly select an option, that has
| no consequence, compared to actively considering that not
| selecting something is an option. Not selecting something
| feels more like actively participating than just checking a
| box and moving on. /shrug
| ssl-3 wrote:
| We -- the people who live in front of a computer -- have
| been training ourselves to avoid noticing annoyances like
| captchas, advertising, and GDPR notices for quite a long
| time.
|
| We find what appears to be the easiest combination "Fuck
| off, go away" buttons and use them without a moment of
| actual consideration.
|
| (This doesn't mean that it's actually the easiest method.)
| grahamj wrote:
| I can't even believe how many times in a day I
| frustratedly think "whatever, go away!"
| apparent wrote:
| I wonder if they down-weight responses that come in too fast
| to be meaningful, or without sufficient scrolling.
| losteric wrote:
| That's fine. Your random click would be balanced by someone
| else randomly clicking
| danilocesar wrote:
| I almost always pick the second one, because it's closer to the
| submit button and the one I read first.
| arijo wrote:
| People could be flipping a coin and the score would be the
| same.
| brianstrimp wrote:
| A 12% margin is literally the opposite of a coin flip. Unless
| you have a really bad coin.
| buggy6257 wrote:
| You're being downvoted for 3 reasons:
|
| 1) Coming off as a jerk, and from a new account is a bad
| look
|
| 2) "Literally the opposite of a coin flip" would probably
| be either 0% or 100%
|
| 3) Your reasoning doesn't stand up without further info; it
| entirely depends on the sample size. I could have 5 coin
| flips all come up heads, but over thousands or millions it
| averages to 50%. 56% on a small sample size is absolutely
| within margin of error/noise. 56% on a MASSIVE sample size
| is _statistically_ significant, but isn't even still that
| much to brag about for something that I feel like they
| probably intended to be a big step forward.
| brianstrimp wrote:
| I'm a little puzzled by your response.
|
| 1. The message was net-upvoted. Whether there are
| downvotes in there I can't tell, but the final karma is
| positive. A similarly spirited message of mine in the
| same thread was quite well receive as well.
|
| 2. I can't see how my message would come across as a
| jerk? I wrote 2 simple sentences, not using any offensive
| language, stating a mere fact of statistics. Is that
| being jerk? And a long-winded berating of a new member of
| the community isn't?
|
| 3. A coin flip is 50%. Anything else is not, once you
| have a certain sample size. So, this was not. That was my
| statement. I don't know why you are building a strawman
| of 5 coin flips. 56% vs 44% is a margin of 12%, as I
| stated, and with a huge sample size, which they had,
| that's _massive_ in a space where the returns are deep in
| "diminishing" territory.
| teeray wrote:
| This prompt is like "See Attendant" on the gas pump. I'm just
| going to use another AI instead for this chat.
| ninkendo wrote:
| Glad to know I'm not the only person who just drives to the
| next station when I see a "see attendant" message.
| janalsncm wrote:
| > I'm usually just selecting the one that answered first
|
| Which is why you randomize the order. You aren't a tester.
|
| 56% vs 44% may not be noise. That's why we have p values. It
| depends on sample size.
| jhardy54 wrote:
| The order doesn't matter. They often generate tokens at
| different speeds, and produce different lengths of text. "The
| one that answered first" != "The first option"
| letmevoteplease wrote:
| The article says "expert testers."
|
| "Evaluations by expert testers showed that o3-mini produces
| more accurate and clearer answers, with stronger reasoning
| abilities, than OpenAI o1-mini. Testers preferred o3-mini's
| responses to o1-mini 56% of the time and observed a 39%
| reduction in major errors on difficult real-world questions. W"
| resters wrote:
| I too have questioned the approach of showing the long side-by-
| side answers from two different models.
|
| 1) sometimes I wanted the short answer, and so even though the
| long answer is better I picked the short one.
|
| 2) sometimes both contain code that is different enough that I
| am inclined to go with the one that is more similar to what I
| already had, even if the other approach seems a bit more solid.
|
| 3) Sometimes one will have less detail but more big picture
| awareness and the other will have excellent detail but miss
| some overarching point that is valuable. Depending on my mood I
| sometimes choose but it is annoying to have to do so because I
| am not allowed to say why I made the choice.
|
| The area of human training methodology seems to be a big part
| of what got Deepseek's model so strong. I read the explanation
| of the test results as an acknowledgement by OpenAI of some
| weaknesses in its human feedback paradigm.
|
| IMO the way it should work is that the thumbs up or down should
| be read in context by a reasoning being and a more in-depth
| training case should be developed that helps future models
| learn whatever insight the feedback should have triggered.
|
| Feedback that A is better or worse than B is definitely not (in
| my view) sufficient except in cases where a response is a total
| dud. Usually the responses have different strengths and
| weaknesses and it's pretty subjective which one is better.
| brianstrimp wrote:
| That makes the result stronger though. _Even though_ many
| people click randomly, there is _still_ a 12% margin between
| both groups. Not the world, but still quite a lot.
| dionian wrote:
| i enjoy it, i like getting two answers for free - often one of
| them is significantly better and probably the newer model
| mullingitover wrote:
| You know you can configure default instructions to your
| prompts, right?
|
| I have something like "always be terse and blunt with your
| answers."
| gcanyon wrote:
| I don't think they make it clear: I wonder if they mean testers
| prefer o3 mini 56% of the time _when they express an opinion_ ,
| or overall? Some percentage of people don't choose; if that
| number is 10% and they aren't excluded, that means 56% of the
| time people prefer o3 mini, 34% of the time people prefer o1
| mini, and 10% of the time people don't choose. I'm not sure I
| think it would be reasonable to present the data that way, but
| it seems possible.
| ricardobeat wrote:
| This is just a way to prove, statistically, that one model is
| better than another as part of its validation. It's not
| collected from normal people using ChatGPT, you don't ever get
| shown two responses from different models at once.
| yawnxyz wrote:
| Wait what? I get shown this with ChatGPT maybe 5% of the time
| nearbuy wrote:
| Those are both responses from the same model. It's not one
| response from o1 and another from o3.
| directevolve wrote:
| Also, it's not clear if the preference comes from the quality
| of the 'meat' of the answer, or the way it reports its thinking
| and the speed with which it responds. With o1, I get a marked
| feeling of impatience waiting for it to spit something out, and
| the 'progress of thought' is in faint grey text I can't read.
| With o3, the 'progress of thought' comes quickly, with more to
| read, and is more engaging even if I don't actually get
| anything more than entertainment value.
|
| I'm not going to say there's nothing substantive about o3 vs.
| o1, but I absolutely do not put it past Sam Altman to juice the
| stats every chance he gets.
| energy123 wrote:
| Then 56% is even more impressive. Example: if 80% choose
| randomly and 20% choose carefully, that implies an 80%
| preference rate for o3-mini (0.8*0.2 + 0.5*0.8 = 0.56)
| shombaboor wrote:
| It seems like the the first response must get chosen a majority
| of the time just to account for friction
| EcommerceFlow wrote:
| First thing I noticed on API and Chat for it is THIS THING IS
| FAST. That alone makes it a huge upgrade to o1-pro (not really
| comparable I know, just saying). Can't imagine how much I'll get
| done with this type of speed.
| GaggiX wrote:
| The API pricing is almost exactly double the deepseek ones.
| cheema33 wrote:
| I like deepseek a lot. But they are currently very glitchy. The
| API service goes up and down a lot. Maybe they'll sort that out
| soon.
| orbital-decay wrote:
| Apparently they're under a very targeted DDoS for almost a
| month, with technical details shared in Chinese but very
| little discussion in English. Which is surprising, it's not
| like major AI products are getting DDoSed out of existence
| every day.
| nmfisher wrote:
| Where are the details in Chinese?
| bearjaws wrote:
| Almost all of thm are protected by cloudflare if you look.
|
| My guess is Deepseek didn't implement anti-DDOS until way
| too late.
| mise_en_place wrote:
| Too little too late IMO. This is not impressive at all, what am I
| missing here?
| ben_w wrote:
| There's only two kinds of software, prototype and obsolete.
|
| I was taught that last millennium.
| esafak wrote:
| That's not true. Is Google Maps a prototype or obsolete?
| ben_w wrote:
| The website or the database? I'd say the former is obsolete
| and the latter is still a prototype.
| esafak wrote:
| I understand that saas products are constantly evolving
| but this is an unusual definition of obsolescence and
| prototypes. Google Maps has been running like a tank for
| two decades, and it is pretty feature complete.
| jstummbillig wrote:
| Idk, everything: The price point + performance?
| sumedh wrote:
| > This is not impressive at all, what am I missing here?
|
| Compared to?
| RobinL wrote:
| Wow - this is seriously fast (o3-mini), and my initial
| impressions are very favourable. I was asking it to layout quite
| a complex html form from a schema and it did a very good job.
|
| Looking at the comments on here and the benchmark results I was
| expecting it to be a bit meh, but initial impressions are quite
| the opposite
|
| I was expecting it to perhaps be a marginal improvement for
| complex things that need a lot of 'reasoning', but it seems it's
| a bit improvement for simple things that you need doing fast
| bn-l wrote:
| It's 2x the price of R1:
| https://x.com/deedydas/status/1885440582103031940/photo/1
|
| Is it twice as good though?
| RobinL wrote:
| Whilst I had tried R1 before, I hadn't paid attention to how
| fast it was. I just tried some similar prompts and was pretty
| impressed with speed and quality. I think o3-mini was still a
| bit quicker though.
| AISnakeOil wrote:
| The naming convention is so messed up. o1, o3-mini (no o2, no
| o3???)
| igravious wrote:
| https://www.perplexity.ai/search/new?q=list%20of%20all%20Ope...
| :)
|
| OpenAI has developed a variety of models that cater to
| different applications, from natural language processing to
| image generation and audio processing. Here's a comprehensive
| list of the current models available: ##
| Language Models - \*GPT-4o\*: The flagship model capable
| of processing text, images, and audio. - \*GPT-4o
| mini\*: A smaller, more cost-effective version of GPT-4o.
| - \*GPT-4\*: An advanced model that improves upon GPT-3.5.
| - \*GPT-3.5\*: A set of models that enhance the capabilities of
| GPT-3. - \*GPT-3.5 Turbo\*: A faster variant designed
| for efficiency in chat applications. ## Reasoning
| Models - \*o1\*: Focused on reasoning tasks with
| improved accuracy. - \*o1-mini\*: A lightweight version
| of the o1 model. - \*o3\*: The successor to o1,
| currently in testing phases. - \*o3-mini\*: A lighter
| version of the o3 model. ## Audio Models -
| \*GPT-4o audio\*: Supports real-time audio interactions and
| audio generation. - \*Whisper\*: For transcribing and
| translating speech to text. ## Image Models
| - \*DALL-E\*: Generates images from textual descriptions.
| ## Embedding Models - \*Embeddings\*: Converts text into
| numerical vectors for similarity tasks. - \*Ada\*: An
| embedding model with various sizes (e.g., ada-002).
| ## Additional Models - \*Text to Speech (Preview)\*:
| Synthesizes spoken audio from text.
|
| These models are designed for various tasks, including coding
| assistance, image generation, and conversational AI, making
| OpenAI's offerings versatile for developers and businesses
| alike[1][2][4][5].
|
| Citations: [1] https://learn.microsoft.com/vi-
| vn/azure/ai-services/openai/concepts/models [2]
| https://platform.openai.com/docs/models [3]
| https://llm.datasette.io/en/stable/openai-models.html
| [4] https://en.wikipedia.org/wiki/OpenAI_API [5]
| https://industrywired.com/open-ai-models-list-top-models-to-
| consider/ [6] https://holypython.com/python-api-
| tutorial/listing-all-available-openai-models-openai-api/
| [7] https://en.wikipedia.org/wiki/GPT-3 [8]
| https://stackoverflow.com/questions/78122648/openai-api-how-do-
| i-get-a-list-of-all-available-openai-models/78122662
| ben_w wrote:
| There's an o1-mini, there's an o3 it just hasn't gone live yet:
| https://openai.com/12-days/#day-12
|
| they can't call it o2 because:
| https://en.wikipedia.org/wiki/The_O2_Arena
|
| and the venue's sponsor: https://en.wikipedia.org/wiki/O2_(UK)
| sumedh wrote:
| o3 will come later.
|
| o2 was not selected because there is already another brand with
| that name in UK
| thimabi wrote:
| Does anyone know the current usage limits for o3-mini and
| o3-mini-high when used through the ChatGPT interface? I tried to
| find them on the OpenAI Knowledgebase, but couldn't find anything
| about that.
| keenmaster wrote:
| For Plus users the limits are:
|
| o3-mini-high: 50 messages per week (just like o1, but it seems
| like these are non-shared limits, so you can have 50 messages
| per week with o1, run out, and still have 50 messages with
| o3-mini-high to use)
|
| o3-mini: 150 messages per day
|
| Source for the latter is their press release. They were more
| vague about o3-mini-high, but people have already tested its
| limits just by using it, and got the pop-up for 25 messages
| left after sending 25 messages.
|
| It's nice not to worry about running out of o1 messages now and
| have a faster model that's mostly as good (potentially better
| in some areas?). OpenAI really needs to release a middle tier
| for 30 to $40 though that has the same models as Pro but
| without infinite usage. I hate not having the smartest model
| and I don't want to pay $200; there's probably a middle ground
| where they can make as much or more money from me on a
| subscription tier that gives limited access to o1-pro.
| scarface_74 wrote:
| This took 1:53 in o3-mini
|
| https://chatgpt.com/share/679d310d-6064-8010-ba78-6bd5ed3360...
|
| The 4o model without using the Python tool
|
| https://chatgpt.com/share/679d32bd-9ba8-8010-8f75-2f26a792e0...
|
| Trying to get accurate results with the paid version of 4o with
| the Python interpreter.
|
| https://chatgpt.com/share/679d31f3-21d4-8010-9932-7ecadd0b87...
|
| The share link doesn't show the output for some reason. But it
| did work correctly. I don't know whether the ages are correct. I
| was testing whether it could handle ordering
|
| I have no idea what conclusion I should draw from this besides
| depending on the use case, 4o may be better with "tools" if you
| know your domain where you are using it.
|
| Tools are relatively easy to implement with LangChain or the
| native OpenAI SDK.
| margalabargala wrote:
| The 4o model's output is blatantly wrong. I'm not going to look
| up if it's the order or the ages that are incorrect, but:
|
| 36. Abraham Lincoln - 52 years, 20 days (1861)
|
| 37. James Garfield - 49 years, 105 days (1881)
|
| 38. Lyndon B. Johnson - 55 years, 87 days (1963)
|
| Basically everything after #15 in the list is scrambled.
| scarface_74 wrote:
| That was the point. The 4o model without using Python was
| wrong. The o3 model worked correctly without needing an
| external tool
| BeetleB wrote:
| I would not expect any LLM to get this right. I think people
| have too high expectations for it.
|
| Now if you asked it to write a Python program to list them in
| order, and have it enter all the names, birthdays, and year
| elected in a list to get the program to run - that's more
| reasonable.
| scarface_74 wrote:
| The "o" models get the order right.
|
| DeepSeek also gets the order right.
|
| It doesn't show on the share link. But it actually outputs
| the list correctly from the built in Python interpreter.
|
| For some things, ChatGPT 4o will automatically use its Python
| runtime
| BeetleB wrote:
| That some models get it right is irrelevant. In general, if
| your instructions require _computation_ , it's safer to
| assume it won't get it right and will hallucinate.
| scarface_74 wrote:
| The reasoning models all do pretty good at math.
|
| Have you tried them?
|
| This is something I threw together with o3-mini
|
| https://chatgpt.com/share/679d5305-5f04-8010-b5c4-61c31e7
| 9b2...
|
| ChatGPT 4o doesn't even try to do the math internally and
| uses its built in Python interpreter. (The [_>] link is
| to the Python code)
|
| https://chatgpt.com/share/679d54fe-0104-8010-8f1e-9796a08
| cf9...
|
| DeepSeek handles the same problem just as well using the
| reasoning technique.
|
| Of course ChatGPT 4o went completely off the rails
| without using its Python interpreter
|
| https://chatgpt.com/share/679d5692-96a0-8010-8624-b1eb091
| 270...
|
| (The break down that it got right was using Python even
| though I told it not to)
| simonw wrote:
| I just pushed a new release of my LLM CLI tool with support for
| the new model and the reasoning_effort option:
| https://llm.datasette.io/en/stable/changelog.html#v0-21
|
| Example usage: llm -m o3-mini 'write a poem about
| a pirate and a walrus' \ -o reasoning_effort high
|
| Output (comparing that with the default reasoning effort):
| https://github.com/simonw/llm/issues/728#issuecomment-262832...
|
| (If anyone has a better demo prompt I'd love to hear about it)
| beklein wrote:
| Thank you for all the effort you put into this tool and keeping
| it up to date!
| theturtle32 wrote:
| A reasoning model is not meant for writing poetry. It's not
| very useful to evaluate it on such tasks.
| mediaman wrote:
| It's not clear that writing poetry is a bad use case.
| Reasoning models seem to actually do pretty well with
| creative writing and poetry. Deepseek's R1, for example, has
| much better poem structure than the underlying V3, and
| writers are saying R1 was the first model where they actually
| felt like it was a useful writing companion. R1 seems to
| think at length about word choice, correcting structure,
| pentameter, and so on.
| 1propionyl wrote:
| Indeed. I would assume that a reasoning model would do far
| better at things like actually maintaining meter or rhyme
| scheme, something that models (even with good attention
| mechanisms) generally do very poorly at.
| mquander wrote:
| I tried to tell my English teachers that all through high
| school but it never worked.
| aprilthird2021 wrote:
| To be blunt, an AI isn't a good tool for writing poetry
| either. At least, not the kind people read as a high
| literature form. For commercials, jingles, Hallmark cards,
| etc. sure
| DonHopkins wrote:
| There exists poetry that requires a lot of mathematical
| understanding! This is "literally" (and I mean literally in
| the literary sense) from a Stanislaw Lem story about an
| electronic bard, translated from Polish by Michael Kandel:
|
| https://www.donhopkins.com/home/catalog/lem/WonderfulPoems.h.
| ..
|
| Prompt:
|
| A love poem, lyrical, pastoral, and expressed in the language
| of pure mathematics. Tensor algebra mainly, with a little
| topology and higher calculus, if need be. But with feeling,
| you understand, and in the cybernetic spirit.
|
| Response: Come, let us hasten to a higher
| plane, Where dyads tread the fairy fields of Venn,
| Their indices bedecked from one to n, Commingled in
| an endless Markov chain! Come, every frustum longs to
| be a cone, And every vector dreams of matrices.
| Hark to the gentle gradient of the breeze: It
| whispers of a more ergodic zone. In Riemann,
| Hilbert or in Banach space Let superscripts and
| subscripts go their ways. Our asymptotes no longer
| out of phase, We shall encounter, counting, face to
| face. I'll grant thee random access to my heart,
| Thou'lt tell me all the constants of thy love; And so
| we two shall all love's lemmas prove, And in our
| bound partition never part. For what did Cauchy
| know, or Christoffel, Or Fourier, or any Boole or
| Euler, Wielding their compasses, their pens and
| rulers, Of thy supernal sinusoidal spell?
| Cancel me not -- for what then shall remain?
| Abscissas, some mantissas, modules, modes, A root or
| two, a torus and a node: The inverse of my verse, a
| null domain. Ellipse of bliss, converse, O lips
| divine! The product of our scalars is defined!
| Cyberiad draws nigh, and the skew mind cuts capers
| like a happy haversine. I see the eigenvalue in
| thine eye, I hear the tender tensor in thy sigh.
| Bernoulli would have been content to die, Had he but
| known such a squared cosine 2 phi!
|
| From The Cyberiad, by Stanislaw Lem.
|
| Translated from Polish by Michael Kandel.
|
| Here's a previous discussion of Marcin Wichary's translation
| of one of Lem's stories from Polish to English. He created
| the Lem Google Doodle, and he stalked and met Stanislaw Lem
| when he was a boy. Plus a discussion of Michael Kandel's
| translation of the poetry of the Electric Bard from The First
| Sally of Cyberiad, comparing it to machine translation:
|
| https://news.ycombinator.com/item?id=28600200
|
| Stanislaw Lem has finally gotten the translations his genius
| deserves:
|
| https://www.washingtonpost.com/entertainment/books/stanislaw.
| ..
|
| >Lem's fiction is filled with haunting, prescient landscapes.
| In these reissued and newly issued translations -- some by
| the pitch-perfect Lem-o-phile, Michael Kandel -- each
| sentence is as hard, gleaming and unpredictable as the next
| marvelous invention or plot twist. It's hard to keep up with
| Lem's hyper-drive of an imagination but always fun to try.
| hybrid_study wrote:
| Is this version a prank?
| RivieraKid wrote:
| No.
| simonw wrote:
| I used o3-mini to summarize this thread so far. Here's the
| result:
| https://gist.github.com/simonw/09e5922be0cbb85894cf05e6d75ae...
|
| For 18,936 input, 2,905 output it cost 3.3612 cents.
|
| Here's the script I used to do it:
| https://til.simonwillison.net/llms/claude-hacker-news-themes...
| layman51 wrote:
| I noticed that it thought that GoatInGrey wrote "openai is no
| longer relevant." However, they were just quoting a different
| user (buyucu) who was the person who first wrote that.
| simonw wrote:
| Good catch. That's likely an artifact of the way I flatten
| the nested JSON from the comments API.
|
| I originally did that to save on tokens but modern models
| have much larger input windows so I may not need to do that
| any more.
| threecheese wrote:
| I haven't tried o3, but one issue I struggle with in large
| context analysis tasks is the LLMs are never thorough. In a
| task like this thread summarization, I typically need to break
| the document down and loop through chunks to ensure it actually
| "reads" everything. I might have had to recurse into individual
| conversations with some small max-depth and leaf count and run
| inference on each, and then have some aggregation at the end,
| otherwise it would miss a lot (or appear to, based on the
| output).
|
| Is this a case of PEBKAC?
| syntaxing wrote:
| Depending on what you're trying to do, it's worth trying the
| 1M context Qwen Models. They only released 7 and 14B so it's
| "intelligence" is limited but should be more than capable for
| coherent summary.
| andrewci wrote:
| Are there any tools you use to do this chunking? Or is this a
| custom workflow? I've noticed the same thing both on
| copy/paste text and uploaded documents when using the LLM
| chat tools.
| sdesol wrote:
| > I haven't tried o3, but one issue I struggle with in large
| context analysis tasks is the LLMs are never thorough.
|
| o3 does look very promising with regards to large context
| analysis. I used the same raw data and ran the same prompt as
| Simon for GPT-4o, GPT-4o mini and DeepSeek R1 and compared
| their output. You can find the analysis below:
|
| https://beta.gitsense.com/?chat=46493969-17b2-4806-a99c-5d93.
| ..
|
| The o3-min model was quite thorough. With reasoning models,
| it looks like dealing with long context might have gotten a
| lot better.
|
| Edit:
|
| I was curious if I could get R1 to be more thorough and got
| the following interesting tidbits.
|
| - Depth Variance: R1 analysis provides more technical
| infrastructure insights, while o3-mini focuses on developer
| experience
|
| - Geopolitical Focus: Only R1 analysis addresses China-West
| tensions explicitly
|
| - Philosophical Scope: R1 contains broader industry meta-
| commentary absent in o3-mini
|
| - Contrarian Views: o3-mini dedicates specific section to
| minority opinions
|
| - Temporal Aspects: R1 emphasizes future-looking questions,
| o3-mini focuses on current implementation
|
| You can find the full analysis at
|
| https://beta.gitsense.com/?chat=95741f4f-b11f-4f0b-8239-83c7.
| ..
| mvkel wrote:
| o1-pro is incredibly good at this. You'll be amazed
| scarface_74 wrote:
| Try Google's NotebookLM
| breakingcups wrote:
| It's definitely making some errors
| kandesbunzler wrote:
| Like?
| aprilthird2021 wrote:
| Even though it was told that it MUST quote users directly,
| it still outputs:
|
| > It's already a game changer for many people. But to have
| so many names like o1, o3-mini, GPT-4o, & GPT-4o-mini
| suggests there may be too much focus on internal tech
| details rather than clear communication." (paraphrase based
| on multiple similar sentiments)
|
| It also hallucinates quotes.
|
| For example:
|
| > "I'm pretty sure 'o3-mini' works better for that purpose
| than 'GPT 4.1.3'." - TeMPOraL
|
| But that comment is not in the user TeMPOraL's comment
| history.
|
| Sentiment analysis is also faulty.
|
| For example:
|
| > "I'd bet most users just 50/50 it, which actually makes
| it more remarkable that there was a 56% selection rate." -
| jackbrookes - This quip injects humor into an otherwise
| technical discussion about evaluation metrics.
|
| It's not a quip though. That comment was meant in earnest
| romanhn wrote:
| That's funny, the quote exists, but it got the user
| wrong.
| Eduard wrote:
| 3.3612 cents (I guess USD cents) is expensive!
| BoorishBears wrote:
| Same immediate thought: the free option I provide on my
| production site is a model that runs on 2xA40. That's 96GB of
| VRAM for 78 cents an hour serving at least 4 or 5 concurrent
| requests at any given time.
|
| O3 Mini is probably not a very large model and OpenAI has
| layers upon layers of efficiencies, so they must be making an
| absolute _killing_ charging 3.3 cents for a few seconds of
| compute
| largbae wrote:
| Thanks for sharing this! And no apparent self-awareness! OpenAI
| has come a long way from the Sydney days:
| https://answers.microsoft.com/en-us/bing/forum/all/this-ai-c...
| Valakas_ wrote:
| For those that like simpler ways (although dependent on Google)
| NotebookLM does all that in 2 clicks. And you can ask it
| questions about it, references are provided.
| simonw wrote:
| After you've run my hn-summary.sh script you can ask follow
| up questions like this: llm -c "did anyone
| talk about pricing?"
| tkgally wrote:
| Borrowing most of Simon's prompt, I tried the following with
| o3-mini-high in the chat interface with Search turned on:
|
| "Summarize the themes of the opinions expressed in discussions
| on Hacker News on January 31 and February 1, 2025, about
| OpenAI's release od [sic] ChatGPT o3-mini. For each theme,
| output a header. Include direct "quotations" (with author
| attribution) where appropriate. You MUST quote directly from
| users when crediting them, with double quotes. Fix HTML
| entities. Go long. Include a section of quotes that illustrate
| opinions uncommon in the rest of the piece"
|
| The result is here:
|
| https://chatgpt.com/share/679d790d-df6c-8011-ad78-3695c2e254...
|
| Most of the cited quotations seem to be accurate, but at least
| one (by uncomplexity_) does not appear in the named commenter's
| comment history.
|
| I haven't attempted to judge how accurate the summary is. Since
| the discussions here are continuing at this moment, this
| summary will be gradually falling out of date in any case.
| thousand_nights wrote:
| good morning!
| jajko wrote:
| A random idea - train one of those models on _you_ , keep it
| aside, let it somehow work out your intricacies, moods, details,
| childhood memories, personality, flaws, strengths. Methods can be
| various - initial dump of social networks, personal photos and
| videos, maybe some intense conversation to grok rough you, then
| polish over time.
|
| A first step to digital immortality, could be a nice startup of
| some personalized product for rich, and then even regular folks.
| Immortality not in ourselves as meat bags of course, we die
| regardless, but digital copy and memento that our children can
| use if feeling lonely and can carry with themselves anywhere, or
| later descendants out of curiosity to hold massive events like
| weddings. One could 'invite' long lost ancestors. Maybe your
| grand-grand father would be a cool guy you could easily click
| with these days via verbal input. Heck even 3D detailed model.
|
| An additional service, 'perpetually' paid - keeping your data
| model safe, taking care of it, backups, heck even maybe give it a
| bit of computing power to to receive current news in some light
| fashion and evolve, could be extras. Different tiers for
| different level of services and care.
|
| Or am I decade or two ahead? I can see this as universally
| interesting across many if not all cultures.
| silverlake wrote:
| O3-mini solved this prompt. DeepSeek R1 had a mental breakdown.
| The prompt: "Bob is facing forward. To his left is Ann, to his
| right is Cathy. Ann and Cathy are facing backwards. Who is on
| Ann's left?"
| modeless wrote:
| R1 or R1-Distill? They are not the same thing. I think DeepSeek
| made a mistake releasing them at the same time and calling them
| all R1.
|
| Full R1 solves this prompt easily for me.
| silverlake wrote:
| I used R1 hosted at NVidia here:
| https://build.nvidia.com/deepseek-ai/deepseek-r1/modelcard
| SparkyMcUnicorn wrote:
| Same. With the recommended settings, it got it right. I
| regenerated a bunch of times, and it did suggest Cathy once
| or twice.
|
| R1 70b also got it right just as many times for me.
| modeless wrote:
| Huh, that one got it wrong for me too. I don't have
| patience to try it 10 times each to see if it was a
| coincidence, but it is absolutely true that not all
| implementations of LLMs produce the same outputs. It is in
| fact common for subtle bugs to happen that cause the
| outputs to be worse but not catastrophically bad, and
| therefore go unnoticed. So I wouldn't trust any
| implementation but the original for benchmarking or even
| general use unless I tested it extensively.
| mark_l_watson wrote:
| Thanks for that example! Using 'ollama run' these models all
| failed: deepseek-r1:32b, mistral-small:latest, qwq:latest, and
| qwen2.5:14b.
| lynguist wrote:
| Really interesting puzzle. I tried the different Chatgpts and
| they all solved it. I tried Deepseek and Deepseek R1, and
| Deepseek did have this meltdown:
|
| Let's break down the problem step by step to understand the
| relationships and positions of Bob, Ann, and Cathy. 1.
| Understanding the Initial Setup Bob is facing
| forward. This means Bob's front is oriented in a
| particular direction, which we'll consider as the reference
| point for "forward." To his left is Ann, to his
| right is Cathy. If Bob is facing forward, then:
| Ann is positioned to Bob's left. Cathy is
| positioned to Bob's right. Ann and Cathy are
| facing backwards. Both Ann and Cathy are oriented
| in the opposite direction to Bob. If Bob is facing forward,
| then Ann and Cathy are facing backward.
|
| 2. Visualizing the Positions
|
| To better understand the scenario, let's visualize the
| positions: Copy
|
| Forward Direction: |
|
| Bob (facing forward) | | Ann (facing backward) | / | / | / | /
| | / | / | / |/ |
|
| And then only the character | in a newline forever.
| Synaesthesia wrote:
| Deepseek solved it.
| thinkalone wrote:
| That's a fun, simple test! I tried a few models, and mistral-
| nemo gets it every time, even when run locally without any
| system prompt! https://build.nvidia.com/nv-mistralai/mistral-
| nemo-12b-instr...
| sivakon wrote:
| deepseek answered it.
|
| https://leaflet.pub/63ae4881-d726-4388-9ba3-8a5f86947443
| EternalFury wrote:
| o1-preview, o1, o1-mini, o3-mini, o3-mini (low), o3-mini
| (medium), o3-mini (high)...
|
| What's next?
|
| o4-mini (wet socks), o5-Eeny-meeny-miny-moe?
|
| I thought they had a product manager over there.
|
| They only need 2 names, right? ChatGPT and o.
|
| ChatGPT-5 and o4 would be next.
|
| This multiplication of the LLM loaves and fishes is kind of
| silly.
| mark_l_watson wrote:
| Oh, sweet: both o3-mini low and high support integrated web
| search. No integrated web search with o1.
|
| I prefer, for philosophical reasons, open weight and open
| process/science models, but OpenAI has done a very good job at
| productizing ChatGPT. I also use their 4o-mini API because it is
| cheap and compares well to using open models on Groq Cloud. I
| really love running local models with Ollama but the API venders
| keep the price so low that I understand most people not wanting
| the hasssle if running Deepseek-R, etc., locally.
| swyx wrote:
| for those interested, updated my o3-mini price chart to compare
| the cost-intelligence frontier with deepseek:
| https://x.com/swyx/status/1885432031896887335
| bix6 wrote:
| They use the word reasoning a lot in the post. Is this reasoning
| or statistical prediction?
| jokoon wrote:
| Does that mean I can use this on my recent gaming AMD gpu?
| rednafi wrote:
| The most important detail for me was that in coding, it's weaker
| than 4o and stronger than o1-mini. So I don't have much use for
| it.
| usaar333 wrote:
| Where are you reading that?
| xmichael909 wrote:
| So can I ditch the $200 a month o1 pro account, and go back to
| the $20 account with 03-mini?
| genidoi wrote:
| With o1 pro you're paying for unlimited compute that you don't
| get with $20 + capped o1.
| modeless wrote:
| Initial vibes are not living up to the hype. It fails my pet
| prompt, and the Cursor devs say they still prefer Sonnet[1]. I'm
| sure it will have its uses but it is not going to dominate.
|
| [1] https://x.com/cursor_ai/status/1885415392677675337
| diegocg wrote:
| I hope chatgpt reconsiders the naming of their models some time.
| I have troubles deciding which model is the one I should use.
| esafak wrote:
| They release models too often for a new one to be better at
| everything, so you have to pick the right one for your task.
| niek_pas wrote:
| And that's exactly where good, recognizable branding comes
| in.
| badgersnake wrote:
| 56% is pretty close to 'don't give a toss'
| mvdtnz wrote:
| Wake up honey a new lie generator just dropped.
| sourcecodeplz wrote:
| Even for free users, that is nice
| vok wrote:
| Well, o3-mini-high just successfully found the root cause of a
| seg fault that o1 missed: mistakenly using _mm512_store_si512 for
| an unaligned store that should have been _mm512_storeu_si512.
| nextworddev wrote:
| rip development jobs /s.. or not /s
| throw83288 wrote:
| How do I avoid the angst about this stuff as a student in
| computer science? I love this field but frankly I've been at a
| loss since the rapid development of these models.
| jumploops wrote:
| LLMs are the new compilers.
|
| As a student, you should continue to focus on fundamentals,
| but also adapt LLMs into your workflow where you can.
|
| Skip writing the assembly (now curly braces and semicolons),
| and focus on what the software you're building actually does,
| who it serves, and how it works.
|
| Programming is both changing a lot, and not at all. The
| mechanics may look different, but the purpose is still the
| same: effectively telling computers what to do.
| abdullahkhalids wrote:
| As a former prof. What you should be learning from any STEM
| degree (and many other degrees as well) is to think clearly,
| rigorously, creatively, and with discipline, etc. You also
| need to learn the skill of learning content and skills
| quickly.
|
| The specific contents or skills of your degree don't matter
| that much. In pretty much any STEM field, over the last
| 100ish years, whatever you learned in your undergraduate was
| mostly irrelevant by the time you retired.
|
| Everyone got by, by staying on top of the new developments in
| the field and doing them. With AI, the particular skills
| needed to use the power of computers to do things in the
| world have changed. Just learn those skills.
| danparsonson wrote:
| For all the value that they bring, there is still a good dose
| of parlour tricks and toy examples around, and they need an
| intelligent guiding hand to get the best out of them. As a
| meat brain, you can bring big picture design skills that the
| bots don't have, keeping them on track to deliver a coherent
| codebase, and fixing the inevitable hallucinations. Think of
| it like having a team of optimistic code monkeys with
| terrible memory, and you as the coordinator. I would focus on
| building skills in things like software design/architecture,
| requirements gathering (what do people want and how do you
| design software to deliver it?), in-depth hardware knowledge
| (how to get the best out of your platform), good API design,
| debugging, etc. Leave the CRUD to the robots and be the
| brain.
| mhh__ wrote:
| It's either over, or giving a lot of idiots false confidence
| -- I meet people somewhat regularly who believe they don't
| really need to know what they're doing any more. This is
| probably an arbitrage.
| raincole wrote:
| Angst?
|
| It just means you're less likely be fixing someone else's
| "mistakenly _mm512_store_si512 for been _mm512_storeu_si512"
| error because AI fix(ed) it for you and you can focus on
| other parts of computer science. Computer science surely
| isn't just fixing _mm512_store_si512.
| jiocrag wrote:
| why is this impressive at all? It effectively amounts to
| correcting a typo.
| sandos wrote:
| How many benchmarks for LLMs are there out there?
|
| Is there any evidence of over-fitting on benchmarks, or is there
| truely hidden parts to them?
| ern wrote:
| I haven't bothered with o3 mini, because who wants an "inferior"
| product? I was using 4o as a "smarter Google" until DeepSeek
| appeared (although its web search is being hammered now and I'm
| just using Google ).
|
| o1 seems to have been neutered in the last week lots of
| disclaimers and butt-covering in its responses.
|
| I also had an annoying discussion with o1 about the DC plane
| crash..it doesn't have web access and its cutoff is 2024, so I
| don't expect it know about the crash. However, after saying such
| an event is extremely unlikely and being almost patronisingly
| reassuring, it treated pasted news articles and links (which to
| be sure, it can't access) as "fictionalized", instead of
| acknowledging its own cut-off date, and that it could have been
| wrong. In contrast DeepSeek (with web search turned off) was less
| dismissive of the risks in DC airspace, and more aware of its own
| knowledge cut-off.
|
| Coupled with the limited number of o1 responses for ChatGPT Plus,
| I've cancelled my subscription for now.
| danielovichdk wrote:
| I read this as a full on marketing note targeted towards software
| developers.
| profsummergig wrote:
| Can someone please share the logic behind their version naming
| convention?
| cyounkins wrote:
| I switched an agent from Sonnet V2 to o3-mini (default medium
| mode) and got strangely poor results: only calling 1 tool at a
| time despite being asked to call multiple, not actually doing any
| work, and reporting that it did things it didn't
| binary132 wrote:
| Not really impressed by the answers I just got.
| sshh12 wrote:
| I built a silly political simulation game with this:
| https://state.sshh.io/
|
| https://github.com/sshh12/state-sandbox
| n0id34 wrote:
| Is AI fizzing out or just me? I feel like they're trying to smash
| out new models as fast as they can but in reality they're barely
| any different, it's turning into the smartphone market. New
| iPhone with a slightly better camera and slightly differently
| bevelled edges, get it NOW! But doesn't actually do anything
| better than the iPhone 6.
|
| Claude, GPT 4 onwards, and DeepSeek all feel the same to me. Okay
| to a point, then kinda useless. More like a more convenient
| specialised Google that you need to double check the results of.
| nextworddev wrote:
| on the contrary, it's accelerating since they unlocked a new
| paradigm of scaling
| lordofgibbons wrote:
| Boiling frog. The advances are happening so rapidly, but
| incrementally, that it's not being registered. It just seems
| like the normal state.
|
| Compare LLMs from a year or two ago with the ones out today on
| practically any task. It's night and day difference.
|
| This is specially so when you start taking into account these
| "reasoning" models. It's mind blowing how much better they are
| than "non-reasoning" models for tasks like planning and coding.
|
| https://aider.chat/docs/leaderboards/#aider-polyglot-benchma...
| Alifatisk wrote:
| Any comparison with other models yet?
| antirez wrote:
| Just tested two complicated coding tasks, and surprisingly
| o3-mini-high nailed it while Sonnet 3.5 failed it. Will do more
| tests tomorrow.
| lenerdenator wrote:
| No self-host, no care.
| AutistiCoder wrote:
| the o3-mini model would be useful to me if coding's the only
| thing I need to do in a chat log.
|
| When I use ChatGPT these days, it's to help me write coding
| videos and then the social media posts around those videos. So
| that's two specialties in one chat log.
| anotherpaulg wrote:
| For AI coding, o3-mini scored similarly to o1 at 10X less cost on
| the aider polyglot benchmark [0]. This comparison was with both
| models using high reasoning effort. o3-mini with medium effort
| scored in between R1 and Sonnet. 62% $186 o1 high
| 60% $18 o3-mini high 57% $5 DeepSeek R1 54% $9
| o3-mini medium 52% $14 Sonnet 48% $0 DeepSeek V3
|
| [0] https://aider.chat/docs/leaderboards/
| throw83288 wrote:
| What do you expect to come from full o3 in terms of automating
| software engineering?
| nextworddev wrote:
| o3 (high) might score 80%+
| stavros wrote:
| Do you have plans to try o3-mini-high as the architect and
| Sonnet as the model?
| mvkel wrote:
| I've been using cursor since it launched, sticking almost
| exclusively to claude-3.5-sonnet because it is incredibly
| consistent, and rarely loses the plot.
|
| As subsequent models have been released, most of which claim to
| be better at coding, I've switched cursor to it to give them a
| try.
|
| o1, o1-pro, deepseek-r1, and the now o3-mini. All of these models
| suffer from the exact same "adhd." As an example, in a NextJS
| app, if I do a composer prompt like "on page.tsx [15 LOC], using
| shadcn components wherever possible, update this page to have a
| better visual hierarchy."
|
| sonnet nails it almost perfectly every time, but suffers from
| some date cutoff issues like thinking that shadcn-ui@latest is
| the repo name.
|
| Every single other model, doesn't matter which, does the
| following: it starts writing (from scratch), radix-ui components.
| I will interrupt it and say "DO NOT use radix-ui, use shadcn!" --
| it will respond with "ok!" then begin writing its own components
| from scratch, again not using shadcn.
|
| This is still problematic with o3-mini.
|
| I can't believe it's the models. It must be the instruction-set
| that cursor is giving it behind the scenes, right? No amount of
| .cursorrules, or other instruction, seems to get cursor "locked
| in" the way sonnet just seems to be naturally.
|
| It sucks being stuck on the (now ancient) sonnet, but
| inexplicably, it remains the only viable coding option for me.
|
| Has anyone found a workaround?
| jvanderbot wrote:
| I've found cursor to be too thin a wrapper. Aider is somehow
| significantly more functional. Try that.
| dhc02 wrote:
| Aider, with o1 or R1 as the architect and Claude 3.5 as the
| implementer, is so much better than anything you can
| accomplish with a single model. It's pretty amazing. Aider is
| at least one order of magnitude more effective for me than
| using the chat interface in Cursor. (I still use Cursor for
| quick edits and tab completions, to be clear).
| ChadNauseam wrote:
| I normally use aider by just typing in what I want and it
| magically does it. How do I use o1 or R1 to play the role
| of the "architect"?
| macNchz wrote:
| You can start it with something like:
| aider --architect --model o1 --editor-model sonnet
|
| Then you'll be in "architect" mode, which first prompts
| o1 to design the solution, then you can accept it and
| allow sonnet to actually create the diffs.
|
| Most of the time your way works well--I use sonnet alone
| 90% of the time, but the architect mode is really great
| at getting it unstuck when it can't seem to implement
| what I want correctly, or keeps fixing its mistakes by
| making things worse.
| cruffle_duffle wrote:
| I really want to see how apps created this way scale to
| large codebases. I'm very skeptical they don't turn into
| spaghetti messes.
|
| Coding is basically just about the most precise way to
| encapsulate a problem as a solution possible. Taking a
| loose English description and expanding it into piles of
| code is always going to be pretty leaky no matter how
| much these models spit out working code.
|
| In my experience you have to pay a lot of attention to
| every single line these things write because they'll
| often change stuff or more often make wrong assumptions
| that you didn't articulate. And in my experience they
| _never_ ask you questions unless you specifically prompt
| them to (and keep reminding them to), which means they
| are doing a hell of a lot of design and implementation
| that unless carefully looked over will ultimately be
| wrong.
|
| It really reminds me a bit of when Ruby on Rails came out
| and the blogosphere was full of gushing "I've never been
| more productive in my life" posts. And then you find out
| they were basically writing a TODO app and their previous
| development experience was doing enterprise Java for some
| massive non-tech company. Of course RoR will be a breath
| of fresh air for those people.
|
| Don't get me wrong I use cursor as my daily driver but I
| am starting to find the limits for what these things can
| do. And the idea of having two of these LLM's taking some
| paragraph long feature description and somehow chatting
| with each other to create a scalable bit of code that
| fits into a large or growing codebase... well I find that
| kind of impossible. Sure the code compiles and conforms
| to whatever best practices are out there but there will
| be absolutely no constancy across the app--especially at
| the UX level. These things simply cannot hold that kind
| of complexity in their head and even if they could part
| of a developers job is to translate loose English into
| code. And there is much, much, much, much more to that
| than simply writing code.
| macNchz wrote:
| I see what you're saying and I think that terming this
| "architect" mode has an implication that it's more
| capable than it really is, but ultimately this two model
| pairing is mostly about combining disparate abilities to
| separate the "thinking" from the diff generation. It's
| very effective in producing better results for a single
| prompt, but it's not especially helpful for
| "architecting" a large scale app.
|
| That said, in the hands of someone who is competent at
| assembling a large app, I think these tools can be
| incredibly powerful. I have a business helping companies
| figure out how/if to leverage AI and have built a bunch
| of different production LLM-backed applications _using_
| LLMs to write the code over the past year, and my
| impression is that there is very much something there.
| Taking it step by step, file by file, like you might if
| you wrote the code yourself, describing your concept of
| the abstractions, having a few files describing the
| overall architecture that you can add to the chat as
| needed--little details make a big difference in the
| results.
| tribeca18 wrote:
| I use Cursor and Composer in agent mode on a daily basis,
| and this is basically exactly what happened to me.
|
| After about 3 weeks, things were looking great - but lots
| of spagetti code was put together, and it never told me
| what I didn't know. The data & state management
| architecture I had written was simply just not
| maintainable (tons of prop drilling, etc). Over time, I
| basically learned common practices/etc and I'm finding
| that I have to deal with these problems myself. (how it
| used to be!)
|
| We're getting close - the best thing I've done is create
| documentation files with lots of descriptions about the
| architecture/file structure/state
| management/packages/etc, but it only goes so far.
|
| We're getting closer, but for right now - we're not there
| and you have to be really careful with looking over all
| the changes.
| dwaltrip wrote:
| I haven't tried aider in quite a while, what does it mean
| to use one model as an architect and another as the
| implementer?
| Terretta wrote:
| _Aider now has experimental support for using two models
| to complete each coding task:_
|
| _- An Architect model is asked to describe how to solve
| the coding problem._
|
| _- An Editor model is given the Architect's solution and
| asked to produce specific code editing instructions to
| apply those changes to existing source files._
|
| _Splitting up "code reasoning" and "code editing" in
| this manner has produced SOTA results on aider's code
| editing benchmark. Using o1-preview as the Architect with
| either DeepSeek or o1-mini as the Editor produced the
| SOTA score of 85%. Using the Architect /Editor approach
| also significantly improved the benchmark scores of many
| models, compared to their previous "solo" baseline scores
| (striped bars)._
|
| https://aider.chat/2024/09/26/architect.html
| lukas099 wrote:
| Probably gonna show a lot of ignorance here, but isn't
| that a big part of the difference between our brains and
| AI? That instead of one system, we are many systems that
| are kind of sewn together? I secretly think AGI will just
| be a bunch of different specialized AIs working together.
| Terretta wrote:
| You're in good company in that secret thought.
|
| Have a look at this:
| https://en.wikipedia.org/wiki/Society_of_Mind
| aledalgrande wrote:
| Same with Cline
| zackproser wrote:
| Not trying to be snarky, but the example prompt you provided is
| about 1/15th the length and detail of prompts I usually send
| when working with Cursor.
|
| I tend to exhaustively detail what I want, including package
| names and versions because I've been to that movie before...
| jwpapi wrote:
| Yes this works good for me too rather take your time and do
| the first prompt right
| inerte wrote:
| What works nice also is the text to speech. I find it easier
| and faster to give more context by talking rather than
| typing, and the extra content helps the AI to do its job.
|
| And even though the speech recognition fails a lot on some of
| the technical terms or weirdly named packages, software, etc,
| it still does a good job overall (if I don't feel like
| correcting the wrong stuff).
|
| It's great and has become somewhat of a party trick at work.
| Some people don't even use AI to code that often, and when I
| show them "hey have you tried this?" And just _tell_ the
| computer what I want? Most folks are blown away.
| cadence- wrote:
| Does the Cursor have text-to-speech functionality?
| fud101 wrote:
| you mean speech to text right?
| chefandy wrote:
| Not for me. I first ask Advanced Voice to read me some
| code and have Siri listen and email it to an API I wrote
| which uses Claude to estimate the best cloud provider to
| run that code based on its requirements and then a n8n
| script deploys it and send me the results via twilio.
| mvkel wrote:
| My point was that a prompt that simple could be held and
| executed very well by sonnet, but all other models
| (especially reasoning models) crash and burn.
|
| It's a 15 line tsx file so context shouldn't be an issue.
|
| Makes me wonder if reasoning models are really proper models
| for coding in existing codebases
| liamwire wrote:
| Your last point matches what I've seen some people
| (simonw?) say they're doing currently: using aider to work
| with two models--one reasoning model as an architect, and
| one standard LLM as the actual coder. Surprisingly, the
| results seem pretty good vs. putting everything on one
| model.
| mvkel wrote:
| This is probably the right way to think about it. O1-pro
| is an absolute monster when it comes to architecture. It
| is staggering the breadth and depth that it sees. Ask it
| to actually implement though, and it trips over its
| shoelaces almost immediately.
| goosejuice wrote:
| Can you give an example of this monstrous capability you
| speak of? What have you used it for professionally w.r.t.
| architecture.
| crooked-v wrote:
| If have to write a prompt that long, it'll be faster to just
| write the code.
| aprilthird2021 wrote:
| Shocking to see this because this was essentially the
| reason most of the previous no code solutions never took
| off...
| esperent wrote:
| That sounds exhausting. Wouldn't it be faster to include you
| package.json in the context?
|
| I sometimes do this (using Cline), plus create a .cline file
| at project root which I refine over time and which describes
| both the high level project overview, details of the stack
| I'm using, and technical details I want each prompt to
| follow.
|
| Then each actual prompt can be quite short: _read files x, y,
| and z, and make the following changes..._ where I keep the
| changes concise and logically connected - basically what I
| might do for a single pull request.
| hombre_fatal wrote:
| You're basically saying you write 15x the prompt for the same
| result they get with sonnet.
| kace91 wrote:
| My experience with cursor and sonnet is that it is relatively
| good at first tries, but completely misses the plot during
| corrections.
|
| "My attempt at solving the problem contains a test that fails?
| No problem, let me mock the function I'm testing, so that,
| rather than actually run, it returns the expected value!"
|
| It keeps doing that kind of shenanigans, applying modifications
| that solve the newly appearing problem while screwing the
| original attempt's goal.
|
| I usually get much better results from regular chatgpt copying
| and pasting, the trouble being that it is a major pain to
| handle the context window manually by pasting relevant info and
| reminding what I think is being forgotten.
| jwpapi wrote:
| Yes it's usually worth it to try to write a really good first
| prompt
| earleybird wrote:
| More than once I've found myself going down this 'little
| maze of twisty passages, all alike'. At some point I stop,
| collect up the chain of prompts in the conversation, and
| curate them into a net new prompt that should be a bit
| better. Usually I make better progress - at least for a
| while.
| dr_dshiv wrote:
| Why is it so hard to share/find prompts or distill my own
| damn prompts? There must be good solutions for this --
| whall6 wrote:
| Don't outsource the only thing left for our brains to do
| themselves :/
| garfij wrote:
| What do you find difficult about distilling your own
| prompts?
|
| After any back and forth session I have reasonably good
| results asking something like "Given this workflow, how
| could I have prompted this better from the start to get
| the same results?"
| SamPatt wrote:
| This becomes second nature after a while. I've developed
| an intuition about when a model loses the plot and when
| to start a new thread. I have a base prompt I keep for
| the current project I'm working on, and then I ask the
| model to summarize what we've done in the thread and
| combine them to start anew.
|
| I can't wait until this is a solved problem because it
| does slow me down.
| hahajk wrote:
| Can't you select Chatgpt as the model in cursor?
| kace91 wrote:
| Yes, but for some reason it seems to perform worse there.
|
| Perhaps whatever algorithms Cursor uses to prepare the
| context it feeds the model are a good fit for Claude but
| not so much for the others (?). It's a random guess, but
| whatever the reason, there's a weird worsening of
| performance vs pure chat.
| electroly wrote:
| Yes but every model besides claude-3.5-sonnet sucks in
| Cursor, for whatever reason. They might as well not even
| offer the other models. The other models, even "smarter"
| models, perform vastly poorer or don't support agent
| capability or both.
| delichon wrote:
| Claude makes a lot of crappy change suggestions, but when you
| ask "is that a good suggestion?" it's pretty good at judging
| when it isn't. So that's become standard operating procedure
| for me.
|
| It's difficult to avoid Claude's strong bias for being
| agreeable. It needs more HAL 9000.
| 4b11b4 wrote:
| I'm always asking Claude to propose a variety of
| suggestions for the problem at hand and their trade-offs,
| then evaluating them for the top three proposals and why.
| Then I'll pick one of them and further vet the idea
| esperent wrote:
| > when you ask "is that a good suggestion?" it's pretty
| good at judging when it isn't
|
| Basically a poor man's COT.
| mathieuh wrote:
| Hah, I was trying it the other day in a Go project and it did
| exactly the same thing. I couldn't believe my eyes, it
| basically rewrote all the functions back out in the test file
| but modified slightly so the thing that was failing wouldn't
| even run.
| sheepscreek wrote:
| For my advanced use case involving Python and knowledge of
| finance, Sonnet fared poorly. Contrary to what I am reading
| here, my favorite approach has been to use o1 in agent mode.
| It's an absolute delight to work with. It is like I'm working
| with a capable peer, someone at my level.
|
| Sadly there are some hard limits on o1 with Cursor and I
| cannot use it anymore. I do pay for their $20/month
| subscription.
| electroly wrote:
| > o1 in agent mode
|
| How? It specifically tells me this is unsupported: "Agent
| composer is currently only supported using Anthropic models
| or GPT-4o, please reselect the model and try again."
| energy123 wrote:
| Context length possibly. Prompt adherence drops off with
| context, and anything above 20k tokens is pushing it. I get the
| best results by presenting the smallest amount of context
| possible, including removing comments and main methods and
| functions that it doesn't need to see. It's a bit more work
| (not _that_ much if you have a script that does it for you),
| but the results are worth it. You could test in the chatgpt app
| (or lmarena direct chat) where you ask the same question but
| with minimal hand curated context, and see if it makes the same
| mistake.
| mvkel wrote:
| If it's a context issue, it's an issue with how cursor itself
| sends the context to these reasoning LLMs.
|
| Context alone shouldn't be the reason that sonnet succeeds
| consistently, but others (some which have even bigger context
| windows) fail.
| energy123 wrote:
| Yes, that's what I'm suggesting. Cursor is spamming the
| models with too much context, which harms reasoning models
| more than it harms non-reasoning models (hypothesis, but
| one that aligns with my experience). That's why I
| recommended testing reasoning models outside of Cursor with
| a hand curated context.
|
| The advertised context length being longer doesn't
| necessarily map 1:1 with the actual ability the models have
| to perform difficult tasks over that full context. See for
| example the plots of performance on ARC vs context length
| for o-series models.
| esperent wrote:
| Claude uses Shadcn-ui extensively in the web interface, to the
| point where I think it's been trained to use it over other UI
| components.
|
| So I think you got lucky and you're asking it to write using a
| very specific code library that it's good at, because it
| happens to use it for it's main userbase on the web chat
| interface.
|
| I wonder if you were using a different component library, or
| using Svelte instead of React, would you still find Claude the
| best?
| MaxLeiter wrote:
| We've been working on solving a lot of these issues with v0.dev
| (disclaimer: shadcn and I work on it). We do a lot of pre and
| post-processing to ensure LLMs output valid shadcn code.
|
| We're also talking to the cursor/windsurf/zed folks on how we
| can improve Next.js and shadcn in the editors (maybe something
| like llms.txt?)
| mvkel wrote:
| Thanks for all the work you do! v0 is magical. I absolutely
| love the feature where I can add a chunky component that v0
| made to my repo with npx
| kristopolous wrote:
| "not" and other function words; _usually_ work fine today but
| if I 'm having trouble, the best thing to do is probably be
| inclusive, not exclusive.
| eagleinparadise wrote:
| Cursor is also very user-unfriendly in providing alternative
| models to use in composer (agent). There's a heavy reliance on
| Anthrophic for cursor.
|
| Try using Gemini thinking with Cursor. It barely works. Cmd-k
| outputs the thinking into the code. Its unusable in chat
| because the formatting sucks.
|
| Is there some relationship between Cursor and Anthropic, i
| wonder. Plenty of other platforms seem very eager to give users
| model flexibility, but Cursor seems to be lacking.
|
| I could be wrong, just an observation.
| OkGoDoIt wrote:
| I also have been less impressed by o1 in cursor compared to
| sonnet 3.5. Usually what I will do for a very complicated
| change is ask o1 to architect it, specifically asking it to
| give me a detailed plan for how it would be implemented, but
| not to actually implement anything. I then change the model to
| Sonnet 3.5 to have it actually do the implementation.
|
| And on the side of not being able to get models to understand
| something specific, there's a place in a current project where
| I use a special Unicode apostrophe during some string parsing
| because a third-party API needs it. But any code modifications
| by the AI to that file always replace it with a standard ascii
| apostrophe. I even added a comment on that line to the effect
| of "never replaced this apostrophe, it's important to leave it
| exactly as it is!" And also put that in my cursor rules, and
| sometimes directly in the prompt as well, but it always
| replaces it even for completely unrelated changes. I've had to
| manually fix it like 10 times in the last day, it's
| infuriating.
| foobiekr wrote:
| Have you tried any of the specialty services like Augment? I am
| curious if they are any better or just snake oil.
| chrismsimpson wrote:
| I've coded in many languages over the years but reasonably new
| to the TS/JS/Next world.
|
| I've found if you give your prompts a kind long form "stream of
| consciousness", where you outline snippets of code in markdown
| along with contextual notes and then summarise/outline at the
| end what you actually wish to achieve, you can get great
| results.
|
| Think a long form, single page "documentation" type prompts
| that alternate between written copy/contextual
| intent/description and code blocks. Annotating code blocks with
| file names above the blocks I'm sure helps too. Don't waste
| your context window on redundant/irrelevant information or
| code, stating a code sample is abridged or adding commented
| ellipses seems to do the job.
| twilightfringe wrote:
| ha! good to confirm! I tend to do this, just kind of as a
| double-check thing, but never sure if it actually worked or
| if it was a placebo, lol.
|
| Or end with "from the user's perspective: all the "B"
| elements should light up in excitement when you click "C""
| mvkel wrote:
| Going to try this! Thanks for the tip
| d357r0y3r wrote:
| By the time I've fully documented and explained what I want
| to be done, and then review the result, usually finding that
| it's worse than what I would have written myself, I end up
| questioning my instinct to even reach for this tool.
|
| I like it for general refactoring and day to day small tasks,
| but anything that's relatively domain-specific, I just can't
| seem to get anything that's worth using.
| noahbp wrote:
| Like most AI tools, great for beginners, time-savers for
| intermediate users, and frequently a waste of time in
| domains where you're an expert.
|
| I've used Cursor for shipping better frontend slop, and
| it's great. I skip a lot of trial and error, but not all of
| it.
| d357r0y3r wrote:
| I agree that it's amazing as a learning tool. I think the
| "time to ramp" on a new technology or programming
| language has probably been cut in half or more.
| epolanski wrote:
| ,> and frequently a waste of time in domains where you're
| an expert.
|
| I'm a domain expert and I disagree.
|
| There's many scenarios where using LLMs pays off.
|
| E.g. a long file or very long function are just that, and
| an LLM is faster at understanding it whole not being
| limited in how many things you can track in your mind at
| once (between 4 and 6). It's still gonna be faster at
| refactoring it and testing it than you will.
| Abishek_Muthian wrote:
| Just curious, did you try a code model like Codestral instead
| of a MoE?
| bugglebeetle wrote:
| o3 mini's date cut-off is 2023, so it's unfortunately not gonna
| be useful for anything that requires knowledge of recent
| framework updates, which includes probably all big frontend
| stuff.
| hombre_fatal wrote:
| I have the same experience. Just today I was integrating a new
| logging system with my kubernetes cluster.
|
| I tried out the OP model to make changes to my yaml files. It
| would give short snippets and I'd have to keep trial and
| erroring its suggestions.
|
| Eventually I pasted the original prompt to Claude and it one-
| shot the dang thing with perfect config. Made me wonder why I
| even try new models.
| pknerd wrote:
| OT: How many tokens are being consumed? How much are you paying
| for Claude APIs?
| harshitaneja wrote:
| So I think I finally understood recently why we have these
| divergent groups with one thinking Claude 3.5 Sonnet is the
| best model for coding and another that follow the OpenAI SOTA
| at that moment. I have been a heavy user of ChatGPT, jumping on
| to pro without even thinking for more than a second once
| released. Recently though I took a pause from my usual work on
| statistical modelling, heuristics work and other things in
| certain deep domains to focus on building client APIs and
| frontends and decided to again give Claude a try and it is just
| so great to work with for this usecase.
|
| My hypothesis is its a difference of what you are doing. OpenAI
| O models are much better than others at mathematical modelling
| and such tasks and Claude for more general purpose programming.
| mycall wrote:
| Have you used multi-agent chat sessions with each fielding
| their own specialities and seeing if that improves your use
| cases aka MoE?
| digitcatphd wrote:
| The reality is I suspect one will use different models for
| different things. Think of it like having different modes of
| transportation.
|
| You might use your scooter, bike, car, jet - depending on the
| circumstances. A bike was invented 100 years ago? But it may be
| the best in the right use case. Would still be using DaVinci
| for some things because we haven't bothered swapping it and it
| works fine.
|
| For me - the value of R1/o3 is visible logic that provides an
| analysis that can be critiqued by Sonnet 3.5
| energy123 wrote:
| How to disable the LLM summarized thought traces that get spammed
| into my chat window with o3-mini-high?
|
| Very annoying now having to manually press the "^" to hide the
| verbose thought traces _every single question I ask_ , totally
| breaks flow.
| simonw wrote:
| Now that the dust is settling a little bit, I have published my
| notes so far on o3-mini here:
| https://simonwillison.net/2025/Jan/31/o3-mini/
|
| To save you the click: I think the most interesting things about
| this model are the price - less than half that of GPT-4o while
| being better for many things, most notably code - and the
| increased length limits.
|
| 200,000 tokens input and 100,000 output (compared to 128k/16k for
| GPT-4o and just 8k for DeepSeek R2 and Claude 3.5 on output)
| could open up some interesting new applications, especially at
| that low price.
| zora_goron wrote:
| Does anyone know, how "reasoning effort" is implemented
| technically - does this involve differences in the pre-training,
| RL, or prompting phases (or all)?
| Havoc wrote:
| 5 hours in 500 odd comments. Definitely feels like this has less
| wow factor than previous OAI releases.
| jiocrag wrote:
| This is.... underwhelming.
| mohsen1 wrote:
| It's funny because I asked it to fix my script to show _deepseek_
| 's chain of thoughts in the script but it refuses to answer
| hahaha
| catigula wrote:
| It's actually a bit comforting that it isn't very good.
| waynecochran wrote:
| I just had it convert Swift code to Kotlin and was surprised at
| how the comment was translated. It "knew" the author of the paper
| and what is was doing!? That is wild.
|
| Swift: // // Double Reflection
| Algorithm from Table I (page 7) // in Section 4 of
| https://tinyurl.com/yft2674p // for i in
| 1 ..< N { let X1 = spine[i] ...
|
| Kotlin: // Use the Double Reflection
| Algorithm (from Wang et al.) to compute subsequent frames.
| for (i in 1 until N) { val X1 =
| Vector3f(spine[i]) ...
| smallerize wrote:
| Wow, haven't seen a viglink in a while.
| prompt_overflow wrote:
| Plot twist:
|
| 1. they are trying to obfuscate deepscrape success
|
| 2. they are trying to confuse you. the benchmark margins are
| minimal (and meaningless)
|
| 3. they are trying to get time (with investors) releasing
| nothing-special-models in a predicted schedule (jan -> o3, feb ->
| o3-pro-max, march -> o7-ultra, and in 2026 -> OMG! we've reached
| singularity! (after spending $500B))
|
| -
|
| And at the end of the day, nothing changes for me and neither for
| you. enjoy your time out of this sickness ai hype. bruh!
| zone411 wrote:
| It scores 72.4 on NYT Connections, a significant improvement over
| the o1-mini (42.2) and surpassing DeepSeek R1 (54.4), but it
| falls short of the o1 (90.7).
|
| (https://github.com/lechmazur/nyt-connections/)
| Mr_Bees69 wrote:
| Can't wait till deepseek gets their hands on this
| aussieguy1234 wrote:
| Just gave it a go using open-webui.
|
| One immediate difference I noticed is that o3-mini actually
| observes the system prompt you set. So if I say it's a Staff
| Engineer at Google, it'll stay in character.
|
| That was not possible with o1-mini, it ignored system prompts
| completely.
| revskill wrote:
| Models should be better at clarifying the prompt before actually
| spamming with bad answers.
| energy123 wrote:
| This might be the best publicly available model for coding:
|
| https://livebench.ai/#/?Coding=as
___________________________________________________________________
(page generated 2025-02-01 08:00 UTC)