[HN Gopher] GPT-4.1 in the API
___________________________________________________________________
GPT-4.1 in the API
Author : maheshrijal
Score : 403 points
Date : 2025-04-14 17:01 UTC (5 hours ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| porphyra wrote:
| pretty wild versioning that GPT 4.1 is newer and better in many
| regards than GPT 4.5.
| mhh__ wrote:
| I think they're doing it deliberately at this point
| hmottestad wrote:
| Tomorrow they are releasing the open source GPT-1.4 model :P
| asdev wrote:
| it's worse on nearly every benchmark
| brokensegue wrote:
| no? it's better on AIME '24, Multilingual MMLU, SWE-bench,
| Aider's polyglot, MMMU, ComplexFuncBench
|
| and it ties on a lot of benchmarks
| asdev wrote:
| look at all the graphs in the article
| brokensegue wrote:
| the data i posted all came from the graphs/charts in the
| article
| exizt88 wrote:
| For conversational AI, the most significant part is GPT-4.1 mini
| being 2x faster than GPT-4o at basically the same reasoning
| capabilities.
| bakugo wrote:
| > We will also begin deprecating GPT-4.5 Preview in the API, as
| GPT-4.1 offers improved or similar performance on many key
| capabilities at much lower cost and latency. GPT-4.5 Preview will
| be turned off in three months, on July 14, 2025, to allow time
| for developers to transition.
|
| Well, that didn't last long.
| WorldPeas wrote:
| so we're going back... .4 of a gpt? make it make sense openai..
| elias_t wrote:
| Does someone have the benchmarks compared to other models?
| cbg0 wrote:
| claude 3.7 no thinking (diff) - 60.4%
|
| claude 3.7 32k thinking tokens (diff) - 64.9%
|
| GPT-4.1 (diff) - 52.9% (stat is from the blog post)
|
| https://aider.chat/docs/leaderboards/
| oidar wrote:
| I need an AI to understand the naming conventions that OpenAI is
| using.
| fusionadvocate wrote:
| They envy the USB committee.
| ZeroCool2u wrote:
| No benchmark comparisons to other models, especially Gemini 2.5
| Pro, is telling.
| dmd wrote:
| Gemini 2.5 Pro gets 64% on SWE-bench verified. Sonnet 3.7 gets
| 70%
|
| They are reporting that GPT-4.1 gets 55%.
| hmottestad wrote:
| Are those with <<thinking>> or without?
| energy123 wrote:
| With
| chaos_emergent wrote:
| based on their release cadence, I suspect that o4-mini will
| compete on price, performance, and context length with the
| rest of these models.
| hecticjeff wrote:
| o4-mini, not to be confused with 4o-mini
| sanxiyn wrote:
| Sonnet 3.7's 70% is without thinking, see
| https://www.anthropic.com/news/claude-3-7-sonnet
| egeozcan wrote:
| Very interesting. For my use cases, Gemini's responses beat
| Sonnet 3.7's like 80% of the time (gut feeling, didn't
| collect actual data). It beats Sonnet 100% of the time when
| the context gets above 120k.
| int_19h wrote:
| As usual with LLMs. In my experience, all those metrics are
| useful mainly to tell which models are definitely bad, but
| doesn't tell you much about which ones are good, and
| especially not how the good ones stack against each other
| in real world use cases.
|
| Andrej Karpathy famously quipped that he only trusts two
| LLM evals: Chatbot Arena (which has humans blindly compare
| and score responses), and the r/LocalLLaMA comment section.
| poormathskills wrote:
| Go look at their past blog posts. OpenAI only ever benchmarks
| against their own models.
|
| This is pretty common across industries. The leader doesn't
| compare themselves to the competition.
| dimitrios1 wrote:
| There is no uniform tactic for this type of marketing. They
| will compare against whomever they need to to suit their
| marketing goals.
| oofbaroomf wrote:
| Leader is debatable, especially given the actual
| comparisons...
| swyx wrote:
| also sometimes if you get it wrong you catch unnecessary flak
| kweingar wrote:
| That would make sense if OAI were the leader.
| christianqchung wrote:
| Okay, it's common across other industries, but not this one.
| Here is Google, Facebook, and Anthropic comparing their
| frontier models to others[1][2][3].
|
| [1] https://blog.google/technology/google-deepmind/gemini-
| model-...
|
| [2] https://ai.meta.com/blog/llama-4-multimodal-intelligence/
|
| [3] https://www.anthropic.com/claude/sonnet
| poormathskills wrote:
| Right. Those labs aren't leading the industry.
| awestroke wrote:
| Except they are far from the lead in model performance
| poormathskills wrote:
| Who has a (publicly released) model that is SOTA is
| constantly changing. It's more interesting to see who is
| driving the innovation in the field, and right now that is
| pretty clearly OpenAI (GPT-3, first multi-modal model,
| first reasoning model, ect).
| codingwagie wrote:
| GPT-4.1 probably is a distilled version of GPT-4.5
|
| I dont understand the constant complaining about naming
| conventions. The number system differentiates the models based on
| capability, any other method would not do that. After ten models
| with random names like "gemini", "nebula" you would have no idea
| which is which. Its a low IQ take. You dont name new versions of
| software as completely different software
|
| Also, Yesterday, using v0, I replicated a full nextjs UI copying
| a major saas player. No backend integration, but the design and
| UX were stunning, and better than I could do if I tried. I have
| 15 years of backend experience at FAANG. Software will get
| automated, and it already is, people just havent figured it out
| yet
| rvz wrote:
| > Yesterday, using v0, I replicated a full nextjs UI copying a
| major saas player. No backend integration, but the design and
| UX were stunning, and better than I could do if I tried.
|
| Exactly. Those who do frontend or focus on pretty much anything
| Javascript are, how should I say it? Cooked?
|
| > Software will get automated
|
| The first to go are those that use JavaScript / TypeScript
| engineers have already been automated out of a job. It is all
| over for them.
| codingwagie wrote:
| Yeah its over for them. Complicated business logic and
| sprawling systems are what are keeping backend safe for now.
| But the big front end code bases where individual files (like
| react components) are largely decoupled from the rest of the
| code base is why front end is completely cooked
| camdenreslink wrote:
| I have a medium-sized typescript personal project I work on.
| It probably has 20k LOC of well organized typescript (react
| frontend, express backend). I also have somewhat
| comprehensive docs and cursor project rules.
|
| In general I use Cursor in manual mode asking it to make very
| well scoped small changes (e.g. "write this function that
| does this in this exact spot"). Yesterday I needed to make a
| largely mechanical change (change a concept in the front end,
| make updates to the corresponding endpoints, update the data
| access methods, update the database schema).
|
| This is something very easy I would expect a junior developer
| to be able to accomplish. It is simple, largely mechanical,
| but touches a lot of files. Cursor agent mode puked all over
| itself using Gemini 2.5. It could summarize what changes
| would need to be made, but it was totally incapable of making
| the changes. It would add weird hard coded conditions, define
| new unrelated files, not follow the conventions of the
| surrounding code at all.
|
| TLDR; I think LLMs right now are good for greenfield
| development (create this front end from scratch following
| common patterns), and small scoped changes to a few files. If
| you have any kind of medium sized refactor on an existing
| code base forget about it.
| codingwagie wrote:
| My personal opinion is leveraging LLMs on a large code base
| requires skill. How you construct the prompt, and what you
| keep in context, which model you use, all have a large
| effect on the output. If you just put it into cursor and
| throw your hands up, you probably didnt do it right
| camdenreslink wrote:
| I gave it a list of the changes I needed and pointed it
| to the area of the different files that needed updated. I
| also have comprehensive cursor project rules. If I needed
| to hand hold any more than that it would take
| considerably less time to just make the changes myself.
| Philpax wrote:
| > Cursor agent mode puked all over itself using Gemini 2.5.
| It could summarize what changes would need to be made, but
| it was totally incapable of making the changes.
|
| Gemini 2.5 is currently broken with the Cursor agent; it
| doesn't seem to be able to issue tool calls correctly. I've
| been using Gemini to write plans, which Claude then
| executes, and this seems to work well as a workaround.
| Still unfortunate that it's like this, though.
| camdenreslink wrote:
| Interesting, I've found Gemini better than Claude so I
| defaulted to that. I'll try another refactor in agent
| mode with Claude.
| jsheard wrote:
| > using v0, I replicated a full nextjs UI copying a major saas
| player. No backend integration, but the design and UX were
| stunning
|
| AI is amazing, now all you need to create a stunning UI is for
| someone else to make it first so an AI can rip it off. Not
| beating the "plagiarism machine" allegations here.
| codingwagie wrote:
| Heres a secret: Most of the highest funded VC backed software
| companies are just copying a competitor with a slight product
| spin/different pricing model
| umanwizard wrote:
| Got any examples?
| codingwagie wrote:
| Rippling
| florakel wrote:
| Exactly, they like to call it "bringing new energy to an
| old industry".
| singron wrote:
| > Jim Barksdale, used to say there's only two ways to make
| money in business: One is to bundle; the other is unbundle
|
| https://a16z.com/the-future-of-work-cars-and-the-wisdom-
| in-s...
| Philpax wrote:
| > The number system differentiates the models based on
| capability, any other method would not do that.
|
| Please rank GPT-4, GPT-4 Turbo, GPT-4o, GPT-4.1-nano,
| GPT-4.1-mini, GPT-4.1, GPT-4.5, o1-mini, o1, o1 pro, o3-mini,
| o3-mini-high, o3, and o4-mini in terms of capability without
| consulting any documentation.
| codingwagie wrote:
| Very easy with the naming system?
| bobxmax wrote:
| Really? Is o3-mini-high better than o1-pro?
| vbezhenar wrote:
| In my experience it's better for value/price, but if you
| just need to solve a problem, o1 pro is the best tool
| available.
| umanwizard wrote:
| Btw, as someone who agrees with your point, what's the actual
| answer to this?
| henlobenlo wrote:
| Whats the problem, for the layman it doesnt actually
| matter, and for the experts, its usually very obvious which
| model to use.
| umanwizard wrote:
| That's not true. I'm a layman and 4.5 is obviously better
| than 4o for me, definitely enough to matter.
| henlobenlo wrote:
| You are definitely not a layman if you know the
| difference between 4.5 and 4o. The average user thinks ai
| = openai = chatgpt.
| umanwizard wrote:
| Well, okay, but I'm certainly not an expert who knows the
| fine differences between all the models available on
| chat.com. So I'm somewhere between your definition of
| "layman" and your definition of "expert" (as are, I
| suspect, most people on this forum).
| DiscourseFan wrote:
| LLMs fundamentally have the same contraints no matter how
| much juice you give them or how much you toy with the
| models.
| minimaxir wrote:
| It depends on how you define "capability" since that's
| different for reasoning and nonreasoning models.
| n2d4 wrote:
| Of these, some are mostly obsolete: GPT-4 and GPT-4 Turbo
| are worse than GPT-4o in both speed and capabilities. o1 is
| worse than o3-mini-high in most aspects.
|
| Then, some are not available yet: o3 and o4-mini. GPT-4.1 I
| haven't played with enough to give you my opinion on.
|
| Among the rest, it depends on what you're looking for:
|
| Multi-modal: GPT-4o > everything else
|
| Reasoning: o1-pro > o3-mini-high > o3-mini
|
| Speed: GPT-4o > o3-mini > o3-mini-high > o1-pro
|
| (My personal favorite is o3-mini-high for most things, as
| it has a good tradeoff between speed and reasoning.
| Although I use 4o for simpler queries.)
| Y_Y wrote:
| So where was o1-pro in the comparisons in OpenAI's
| article? I just don't trust any of these first party
| benchmarks any more.
| umanwizard wrote:
| Is 4.5 not strictly better than 4o?
| zeroxfe wrote:
| There's no single ordering -- it really depends on what
| you're trying to do, how long you're willing to wait, and
| what kinds of modalities you're interested in.
| chaos_emergent wrote:
| I meant this is actually straight-forward if you've been
| paying even the remotest of attention.
|
| Chronologically:
|
| GPT-4, GPT-4 Turbo, GPT-4o, o1-preview/o1-mini,
| o1/o3-mini/o3-mini-high/o1-pro, gpt-4.5, gpt-4.1
|
| Model iterations, by training paradigm:
|
| SGD pretraining with RLHF: GPT-4 -> turbo -> 4o
|
| SGD pretraining w/ RL on verifiable tasks to improve
| reasoning ability: o1-preview/o1-mini -> o1/o3-mini/o3-mini-
| high (technically the same product with a higher reasoning
| token budget) -> o3/o4-mini (not yet released)
|
| reasoning model with some sort of Monte Carlo Search
| algorithm on top of reasoning traces: o1-pro
|
| Some sort of training pipeline that does well with sparser
| data, but doesn't incorporate reasoning (I'm positing here,
| training and architecture paradigms are not that clear for
| this generation): gpt-4.5, gpt-4.1 (likely fine-tuned on 4.5)
|
| By performance: hard to tell! Depends on what your task is,
| just like with humans. There are plenty of benchmarks.
| Roughly, for me, the top 3 by task are:
|
| Creative Writing: gpt-4.5 -> gpt-4o
|
| Business Comms: o1-pro -> o1 -> o3-mini
|
| Coding: o1-pro -> o3-mini (high) -> o1 -> o3-mini (low) ->
| o1-mini-preview
|
| Shooting the shit: gpt-4o -> o1
|
| It's not to dismiss that their marketing nomenclature is bad,
| just to point out that it's not that confusing for people
| that are actively working with these models have are a
| reasonable memory of the past two years.
| newfocogi wrote:
| I recognize this is a somewhat rhetorical question and your
| point is well taken. But something that maps well is car
| makes and models:
|
| - Is Ford Better than Chevy? (Comparison across providers) It
| depends on what you value, but I guarantee there's tribes
| that are sure there's only one answer.
|
| - Is the 6th gen 2025 4Runner better than 5th gen 2024
| 4Runner? (Comparison of same model across new releases) It
| depends on what you value. It is a clear iteration on the
| technology, but there will probably be more plastic parts
| that will annoy you as well.
|
| - Is the 2025 BMW M3 base model better than the 2022 M3
| Competition (Comparing across years and trims)? Starts to
| depend even more on what you value.
|
| Providers need to delineate between releases, and years,
| models, and trims help do this. There are companies that will
| try to eschew this and go the Tesla route without models
| years, but still can't get away from it entirely. To a
| certain person, every character in "2025 M3 Competition
| xDrive Sedan" matters immensely, to another person its just
| gibberish.
|
| But a pure ranking isn't the point.
| mrandish wrote:
| Yes, point taken.
|
| However, it's _still_ not as bad as Intel CPU naming in some
| generations or USB naming (until very recently). I know, that
| 's a _very_ low bar... :-)
| tomrod wrote:
| Just add SemVer with an extra tag:
|
| 4.0.5.worsethan4point5
| whalesalad wrote:
| > I don't understand the constant complaining about naming
| conventions.
|
| Oh man. Unfolding my lawn chair and grabbing a bucket of
| popcorn for this discussion.
| latexr wrote:
| > You dont name new versions of software as completely
| different software
|
| macOS releases would like a word with you.
|
| https://en.wikipedia.org/wiki/MacOS#Timeline_of_releases
|
| Technically they still have numbers, but Apple hides them in
| marketing copy.
|
| https://www.apple.com/macos/
|
| Though they still have "macOS" in the name. I'm being tongue-
| in-cheek.
| SubiculumCode wrote:
| Feel free to lay the naming convention rules out for us man.
| throw1235435 wrote:
| > Software will get automated, and it already is, people just
| havent figured it out yet
|
| To be honest I think this is most AI labs (particularly the
| American ones) not-so-secret goal now, for a number of strong
| reasons. You can see it in this announcements, Anthrophic's
| recent Claude 3.7 announcement, OpenAI's first planned agent
| (SWE-Agent), etc etc. They have to justify their worth somehow
| and they see it as a potential path to do that. Remains to be
| seen how far they will get - I hope I'm wrong.
|
| The reasons however for picking this path IMO are:
|
| - Their usage statistics show coding as the main user:
| Anthrophic recently released their stats. Its become the main
| usage of these models, with other usages at best being novelty
| or conveniences for people in relative size. Without this
| market IMO the hype would of already fizzled awhile ago at best
| a novelty when looking at the rest of the user base size.
|
| - They "smell blood" to disrupt and fear is very effective to
| promote their product: This IMO is the biggest one. Disrupting
| software looks to be an achievable goal, but it also is a goal
| that has high engagement compared to other use cases. No point
| solving something awesome if people don't care, or only care
| for awhile (e.g. meme image generation). You can see the
| developers on this site and elsewhere in fear. Fear is the best
| marketing tool ever and engagement can last years. It keeps
| people engaged and wanting to know more; and talking about how
| "they are cooked" almost to the exclusion of everything else
| (i.e. focusing on the threat). Nothing motivates you to know a
| product more than not being able to provide for yourself, your
| family, etc to the point that most other tech
| topics/innovations are being drowned out by AI announcements.
|
| - Many of them are losing money and need a market to disrupt:
| Currently the existing use cases of a chat bot are not yet
| impressive enough (or haven't been till very recently) to
| justify the massive valuations of these companies. Its coding
| that is allowing them to bootstrap into other domains.
|
| - It is a domain they understand: AI dev's know models, they
| understand the software process. It may be a complex domain
| requiring constant study, but they know it back to front. This
| makes it a good first case for disruption where the data, and
| the know how is already with the teams.
|
| TL;DR: They are coming after you, because it is a big fruit
| that is easier to pick for them than other domains. Its also
| one that people will notice either out of excitement (CEO,
| VC's, Management, etc) or out of fear (tech workers, academics,
| other intellectual workers).
| rvz wrote:
| The big change about this announcement is the 1M context window
| on all models.
|
| But the _price_ is what matters.
| croemer wrote:
| Nothing compared to Llama 4's 7M. What matters is how well it
| performs with such long context, not what the technical maximum
| is.
| polytely wrote:
| It seems that OpenAI is really differentiating itself in the AI
| market by developing the most incomprehensible product names in
| the history of software.
| croes wrote:
| They learned from the best: Microsoft
| pixl97 wrote:
| "Hey buddy, want some .Net, oh I mean dotnet"
| nivertech wrote:
| GPT 4 Workgroups
| amarcheschi wrote:
| GpTeams Classic
| greenavocado wrote:
| Microsoft Neural Language Processing Hyperscale Datacenter
| Enterprise Edition 4.1
|
| A massive transformer-based language model requiring:
|
| - 128 Xeon server-grade CPUs
|
| - 25,000MB RAM minimum (40,000MB recommended)
|
| - 80GB hard disk space for model weights
|
| - Dedicated NVIDIA Quantum Accelerator Cards (minimum 8)
|
| - Enterprise-grade cooling solution
|
| - Dedicated 30-amp power circuit
|
| - Windows NT Advanced Server with Parallel Processing
| Extensions
|
| ~
|
| Features:
|
| - Natural language understanding and generation
|
| - Context window of 8,192 tokens
|
| - Enterprise security compliance module
|
| - Custom prompt engineering interface
|
| - API gateway for third-party applications
|
| *Includes 24/7 on-call Microsoft support team and requires
| dedicated server room with raised floor cooling
| jmount wrote:
| Or Intel.
| jfoster wrote:
| I wonder how they decide whether the o or the digit needs to
| come first. (eg. o3 vs 4o)
| oofbaroomf wrote:
| Reasoning models have the o first, non-reasoners have the
| digit first.
| yberreby wrote:
| > Note that GPT-4.1 will only be available via the API. In
| ChatGPT, many of the improvements in instruction following,
| coding, and intelligence have been gradually incorporated into
| the latest version (opens in a new window) of GPT-4o, and we will
| continue to incorporate more with future releases.
|
| The lack of availability in ChatGPT is disappointing, and they're
| playing on ambiguity here. They are framing this as if it were
| unnecessary to release 4.1 on ChatGPT, since 4o is apparently
| great, while simultaneously showing how much better 4.1 is
| relative to GPT-4o.
|
| One wager is that the inference cost is significantly higher for
| 4.1 than for 4o, and that they expect most ChatGPT users not to
| notice a marginal difference in output quality. API users,
| however, will notice. Alternatively, 4o might have been
| aggressively tuned to be conversational while 4.1 is more
| "neutral"? I wonder.
| themanmaran wrote:
| I disagree. From the average user perspective, it's quite
| confusing to see half a dozen models to choose from in the UI.
| In an ideal world, ChatGPT would just abstract away the
| decision. So I don't need to be an expert in the relatively
| minor differences between each model to have a good experience.
|
| Vs in the API, I want to have very strict versioning of the
| models I'm using. And so letting me run by own evals and pick
| the model that works best.
| florakel wrote:
| > it's quite confusing to see half a dozen models to choose
| from in the UI. In an ideal world, ChatGPT would just
| abstract away the decision
|
| Supposedly that's coming with GPT 5.
| yberreby wrote:
| I agree on both naming on stability. However, this wasn't my
| point.
|
| They still have a mess of models in ChatGPT for now, and it
| doesn't look like this is going to get better immediately
| (even though for GPT-5, they ostensibly want to unify them).
| You have to choose among all of them anyway.
|
| I'd like to be able to choose 4.1.
| Tiberium wrote:
| There's a HUGE difference that you are not mentioning: there
| are "gpt-4o" and "chatgpt-4o-latest" on the API. The former is
| the stable version (there are a few snapshot but the newest
| snapshot has been there for a while), and the latter is the
| fine-tuned version that they often update on ChatGPT. All those
| benchmarks were done for the _API_ stable version of GPT-4o,
| since that 's what businesses rely on, not on
| "chatgpt-4o-latest".
| yberreby wrote:
| Good point, but how does that relate to, or explain, the
| decision not to release 4.1 in ChatGPT? If they have a nice
| post-training pipeline to make 4o "nicer" to talk to, why not
| use it to fine-tune the base 4.1 into e.g.
| chatgpt-4.1-latest?
| Tiberium wrote:
| Because chatgpt-4o-latest already has all of those
| improvements, the largest point of this release (IMO) is to
| offer developers a stable snapshot of something that
| compares to modern 4o latest. Altman said that they'd offer
| a stable snapshot of chatgpt 4o latest on the API, he
| perhaps did really mean GPT 4.1.
| yberreby wrote:
| > Because chatgpt-4o-latest already has all of those
| improvements
|
| Does it, though? They said that "many" have already been
| incorporated. I simply don't buy their vague statements
| there. These are different models. They may share some
| training/post-training recipe improvements, but they are
| still different.
| meetpateltech wrote:
| GPT-4.1 Pricing (per 1M tokens):
|
| gpt-4.1
|
| - Input: $2.00
|
| - Cached Input: $0.50
|
| - Output: $8.00
|
| gpt-4.1-mini
|
| - Input: $0.40
|
| - Cached Input: $0.10
|
| - Output: $1.60
|
| gpt-4.1-nano
|
| - Input: $0.10
|
| - Cached Input: $0.025
|
| - Output: $0.40
| minimaxir wrote:
| The cached input price is notable here: previously with GPT-4o
| it was 1/2 the cost of raw input, now it's 1/4th.
|
| It's still not as notable as Claude's 1/10th the cost of raw
| input, but it shows OpenAI's making improvements in this area.
| persedes wrote:
| Unless that has changed, anthropics (and gemini) caches are
| opt-in though if I recall, openai automatically chaches for
| you.
| glenstein wrote:
| Awesome, thank you for posting. As someone who regularly uses
| 4o mini from the API, any guesses or intuitions about the
| performance of Nano?
|
| I'm not as concerned about nomenclature as other people, which
| I think is too often reacting to a headline as opposed to the
| article. But in this case, I'm not sure if I'm supposed to
| understand nano as categorically different than many in terms
| of what it means as a variation from a core model.
| pzo wrote:
| they share in livestream that 4.1-nano is worse than 4o-mini
| - so nano is cheaper, faster and have bigger context but
| worse in intelligence. 4.1mini is smarter but there is price
| increase.
| twistslider wrote:
| The fact that they're raising the price for the mini models by
| 166% is pretty notable.
|
| gpt-4o-mini for comparison:
|
| - Input: $0.15
|
| - Cached Input $0.075
|
| - Output: $0.60
| conradkay wrote:
| Seems like 4.1 nano ($0.10) is closer to the replacement and
| 4.1 mini is a new in-between price
| druskacik wrote:
| That's what I was thinking. I hoped to see a price drop, but
| this does not change anything for my use cases.
|
| I was using gpt-4o-mini with batch API, which I recently
| replaced with mistral-small-latest batch API, which costs
| $0.10/$0.30 (or $0.05/$0.15 when using the batch API). I may
| change to 4.1-nano, but I'd have to be overwhelmed by its
| performance in comparision to mistral.
| glenstein wrote:
| I don't think they ever committed themselves to uniformed
| pricing for mini models. Of course cheaper is better but I
| understand pricing to be contingent on factors specific to
| every next model rather than following from a blanket policy.
| minimaxir wrote:
| It's not the point of the announcement, but I do like the use of
| the (abs) subscript to demonstrate the improvement in LLM
| performance since in these types of benchmark descriptions I
| never can tell if the percentage increase is absolute or
| relative.
| croemer wrote:
| Testing against unspecified other "leading" models allows for
| shenanigangs:
|
| > Qodo tested GPT-4.1 head-to-head against other leading models
| [...] they found that GPT-4.1 produced the better suggestion in
| 55% of cases
|
| The linked blog post goes 404:
| https://www.qodo.ai/blog/benchmarked-gpt-4-1/
| gs17 wrote:
| The post seems to be up now and seems to compare it slightly
| favorable to Claude 3.7.
| croemer wrote:
| Right, now it's up and comparison against Claude 3.7 is
| better than I feared based on the wording. Though why does
| the OpenAI announcement talk of comparison against multiple
| leading models when the Qodo blog post only tests against
| Claude 3.7...
| runako wrote:
| ChatGPT currently recommends I use o3-mini-high ("great at coding
| and logic") when I start a code conversation with 4o.
|
| I don't understand why the comparison in the announcement talks
| so much about comparing with 4o's coding abilities to 4.1.
| Wouldn't the relevant comparison be to o3-mini-high?
|
| 4.1 costs a lot more than o3-mini-high, so this seems like a
| pertinent thing for them to have addressed here. Maybe I am
| misunderstanding the relationship between the models?
| zamadatix wrote:
| 4.1 is a pinned API variant with the improvements from the
| newer iterations of 4o you're already using in the app, so
| that's why the comparison focuses between those two.
|
| Pricing wise the per token cost of o3-mini is less than 4.1 but
| keep in mind o3-mini is a reasoning model and you will pay for
| those tokens too, not just the final output tokens. Also be
| aware reasoning models can take a long time to return a
| response... which isn't great if you're trying to use an API
| for interactive coding.
| ac29 wrote:
| > I don't understand why the comparison in the announcement
| talks so much about comparing with 4o's coding abilities to
| 4.1. Wouldn't the relevant comparison be to o3-mini-high?
|
| There are tons of comparisons to o3-mini-high in the linked
| article.
| Tiberium wrote:
| Very important note:
|
| >Note that GPT-4.1 will only be available via the API. In
| ChatGPT, many of the improvements in instruction following,
| coding, and intelligence have been gradually incorporated into
| the latest version
|
| If anyone here doesn't know, OpenAI _does_ offer the ChatGPT
| model version in the API as chatgpt-4o-latest, but it 's bad
| because they continuously update it so businesses can't reliably
| rely on it being stable, that's why OpenAI made GPT 4.1.
| croemer wrote:
| So you're saying that "ChatGPT-4o-latest (2025-03-26)" in
| LMarena is 4.1?
| granzymes wrote:
| No, that is saying that some of the improvements that went
| into 4.1 have also gone into ChatGPT, including
| chatgpt-4o-latest (2025-03-26).
| pzo wrote:
| yeah I was surprised in they benchmarks during livestream
| they didn't compare to ChatGPT-4o (2025-03-26) but only older
| one.
| exizt88 wrote:
| > chatgpt-4o-latest, but it's bad because they continuously
| update it
|
| Version explicitly marked as "latest" being continuously
| updated it? Crazy.
| sbarre wrote:
| No one's arguing that it's improperly labelled, but if you're
| going to use it via API, you _might_ want consistency over
| bleeding edge.
| IanCal wrote:
| Lots of the other models are checkpoint releases, and latest
| is a pointer to the latest checkpoint. Something being
| continuously updated is quite different and worth knowing
| about.
| rfw300 wrote:
| It can be both properly communicated and still bad for API
| use cases.
| minimaxir wrote:
| OpenAI (and most LLM providers) allow model version pinning for
| exactly this reason, e.g. in the case of GPT-4o you can specify
| gpt-4o-2024-05-13, gpt-4o-2024-08-06, or gpt-4o-2024-11-20.
|
| https://platform.openai.com/docs/models/gpt-4o
| Tiberium wrote:
| Yes, and they don't make snapshots for chatgpt-4o-latest, but
| they made them for GPT 4.1, that's why 4.1 is only useful for
| API, since their ChatGPT product already has the better
| model.
| cootsnuck wrote:
| Okay so is GPT 4.1 literally just the current
| chatpt-4o-latest or not?
| ilaksh wrote:
| Yeah, in the last week, I had seen a strong benchmark for
| chatgpt-4o-latest and tried it for a client's use case. I ended
| up wasting like 4 days, because after my initial strong test
| results, in the following days, it gave results that were
| inconsistent and poor, and sometimes just outputting spaces.
| flakiness wrote:
| Big focus on coding. It feels like a defensive move against
| Claude (and more recently, Gemini Pro) which became very popular
| in that regime. I guess they recently figured out some ways to
| train the model for these "agentic" coding through RL or
| something - and the finding is too new to apply 4.5 on time.
| modeless wrote:
| Numbers for SWE-bench Verified, Aider Polyglot, cost per million
| output tokens, output tokens per second, and knowledge cutoff
| month/year: SWE Aider Cost Fast Fresh
| Claude 3.7 70% 65% $15 77 8/24 Gemini 2.5 64% 69%
| $10 200 1/25 GPT-4.1 55% 53% $8 169 6/24
| DeepSeek R1 49% 57% $2.2 22 7/24 Grok 3 Beta ? 53%
| $15 ? 11/24
|
| I'm not sure this is really an apples-to-apples comparison as it
| may involve different test scaffolding and levels of "thinking".
| Tokens per second numbers are from here:
| https://artificialanalysis.ai/models/gpt-4o-chatgpt-03-25/pr...
| and I'm assuming 4.1 is the speed of 4o given the "latency" graph
| in the article putting them at the same latency.
|
| Is it available in Cursor yet?
| meetpateltech wrote:
| Yes, it is available in Cursor[1] and Windsurf[2] as well.
|
| [1] https://twitter.com/cursor_ai/status/1911835651810738406
|
| [2] https://twitter.com/windsurf_ai/status/1911833698825286142
| cellwebb wrote:
| And free on windsurf for a week! Vibe time.
| tomjen3 wrote:
| Its available for free in Windsurf so you can try it out there.
|
| Edit: Now also in Cursor
| jsnell wrote:
| https://aider.chat/docs/leaderboards/ shows 73% rather than 69%
| for Gemini 2.5 Pro?
|
| Looks like they also added the cost of the benchmark run to the
| leaderboard, which is quite cool. Cost per output token is no
| longer representative of the actual cost when the number of
| tokens can vary by an order of magnitude for the same problem
| just based on how many thinking tokens the model is told to
| use.
| modeless wrote:
| There are different scores reported by Google for "diff" and
| "whole" modes, and the others were "diff" so I chose the
| "diff" score. Hard to make a real apples-to-apples
| comparison.
| jsnell wrote:
| The 73% on the current leaderboard is using "diff", not
| "whole". (Well, diff-fenced, but the difference is just the
| location of the filename.)
| modeless wrote:
| Huh, seems like Aider made a special mode specifically
| for Gemini[1] some time after Google's announcement blog
| post with official performance numbers. Still not sure it
| makes sense to quote that new score next to the others.
| In any case Gemini's 69% is the top score even without a
| special mode.
|
| [1] https://aider.chat/docs/more/edit-formats.html#diff-
| fenced:~...
| jsnell wrote:
| The mode wasn't added after the announcement, Aider has
| had it for almost a year:
| https://aider.chat/HISTORY.html#aider-v0320
|
| This benchmark has an authoritative source of results
| (the leaderboard), so it seems obvious that it's the
| number that should be used.
| modeless wrote:
| OK but it was still added specifically to improve Gemini
| and nobody else on the leaderboard uses it. Google
| themselves do not use it when they benchmark their own
| models against others. They use the regular diff mode
| that everyone else uses.
| https://blog.google/technology/google-deepmind/gemini-
| model-...
| tcdent wrote:
| They just pick the best performer out of the built-in modes
| they offer.
|
| Interesting data point about the models behavior, but even
| moreso it's a recommendation of which way to configure the
| model for optimal performance.
|
| I do consider this to be an apple-to-apples benchmark since
| they're evaluating real world performance.
| anotherpaulg wrote:
| Aider author here.
|
| Based on some DMs with the Gemini team, they weren't aware
| that aider supports a "diff-fenced" edit format. And that it
| is specifically tuned to work well with Gemini models. So
| they didn't think to try it when they ran the aider
| benchmarks internally.
|
| Beyond that, I spend significant energy tuning aider to work
| well with top models. That is in fact the entire reason for
| aider's benchmark suite: to quantitatively measure and
| improve how well aider works with LLMs.
|
| Aider makes various adjustments to how it prompts and
| interacts with most every top model, to provide the very best
| possible AI coding results.
| soheil wrote:
| Yes on both Cursor and Windsurf.
|
| https://twitter.com/cursor_ai/status/1911835651810738406
| msp26 wrote:
| I was hoping for native image gen in the API but better pricing
| is always appreciated.
|
| Gemini was drastically cheaper for image/video analysis, I'll
| have to see how 4.1 mini and nano compare.
| oofbaroomf wrote:
| I'm not really bullish on OpenAI. Why would they only compare
| with their own models? The only explanation could be that they
| aren't as competitive with other labs as they were before.
| greenavocado wrote:
| See figure 1 for up-to-date benchmarks
| https://github.com/KCORES/kcores-llm-arena
|
| (Direct Link) https://raw.githubusercontent.com/KCORES/kcores-
| llm-arena/re...
| poormathskills wrote:
| Go look at their past blog posts. OpenAI only ever benchmarks
| against their own models.
| oofbaroomf wrote:
| Oh, ok. But it's still quite telling of their attitude as an
| organization.
| rvnx wrote:
| It's the same organization that kept repeating that sharing
| weights of GPT would be "too dangerous for the world".
| Eventually DeepSeek thankfully did something like that,
| though they are supposed to be the evil guys.
| kcatskcolbdi wrote:
| I don't mind what they benchmark against as long as, when I use
| the model, it continues to give me better results than their
| competition.
| gizmodo59 wrote:
| Apple compares against its own products most of the times.
| asdev wrote:
| it's worse than 4.5 on nearly every benchmark. just an
| incremental improvement. AI is slowing down
| conradkay wrote:
| It's like 30x cheaper though. Probably just distilled 4.5
| GaggiX wrote:
| It's better on AIME '24, Multilingual MMLU, SWE-bench, Aider's
| polyglot, MMMU, ComplexFuncBench while being much much cheaper
| and smaller.
| asdev wrote:
| and it's worse on just as many benchmarks by a significant
| amount. as a consumer I don't care about cheapness, I want
| the maximum accuracy and performance
| GaggiX wrote:
| As a consumer you care about speed tho, and GPT-4.5 is
| extremely slow, at this point just use a reasoning model if
| you want the best of the best.
| simianwords wrote:
| Sorry what is the source for this?
| Nckpz wrote:
| They don't disclose parameter counts so it's hard to say
| exactly how far apart they are in terms of size, but based on
| the pricing it seems like a pretty wild comparison, with one
| being an attempt at an ultra-massive SOTA model and one being a
| model scaled down for efficiency and probably distilled from
| the big one. The way they're presented as version numbers is
| business nonsense which obscures a lot about what's going on.
| usaar333 wrote:
| Or OpenAI is? After using Gemini 2.5, I did not feel "AI is
| slowing down". It's just this model isn't SOTA.
| HDThoreaun wrote:
| Maybe progress is slowing down but after using gemini 2.5 there
| clearly is still a lot being made.
| elashri wrote:
| Are there any benchmarks or someone who did tests of performance
| of using this long max token models in scenarios where you
| actually use more of this token limit?
|
| I found from my experience with Gemini models that after ~200k
| that the quality drops and that it basically doesn't keep track
| of things. But I don't have any numbers or systematic study of
| this behavior.
|
| I think all providers who announce increased max token limit
| should address that. Because I don't think it is useful to just
| say that max allowed tokens are 1M when you basically cannot use
| anything near that in practice.
| gymbeaux wrote:
| I'm not optimistic. It's the Wild West and comparing models for
| one's specific use case is difficult, essentially impossible at
| scale.
| enginoid wrote:
| There are some benchmarks such as Fiction.LiveBench[0] that
| give an indication and the new Graphwalks approach looks super
| interesting.
|
| But I'd love to see one specifically for "meaningful coding."
| Coding has specific properties that are important such as
| variable tracking (following coreference chains) described in
| RULER[1]. This paper also cautions against Single-Needle-In-
| The-Haystack tests which I think the OpenAI one might be. You
| really need at least Multi-NIAH for it to tell you anything
| meaningful, which is what they've done for the Gemini models.
|
| I think something a bit more interpretable like `pass@1 rate
| for coding turns at 128k` would so much more useful than "we
| have 1m context" (with the acknowledgement that good-enough
| performance is often domain dependant)
|
| [0] https://fiction.live/stories/Fiction-liveBench-
| Mar-25-2025/o...
|
| [1] https://arxiv.org/pdf/2404.06654
| jbentley1 wrote:
| https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o...
|
| IMO this is the best long context benchmark. Hopefully they
| will run it for the new models soon. Needle-in-a-haystack is
| useless at this point. Llama-4 had perfect needle in a haystack
| results but horrible real-world-performance.
| kmeisthax wrote:
| The problem is that while you can train a model with the
| hyperparameter of "context size" set to 1M, there's very little
| 1M data to train on. Most of your model's ability to follow
| long context comes from the fact that it's trained on lots of
| (stolen) books; in fact I believe OpenAI just outright said _in
| court_ that they can 't do long context without training on
| books.
|
| Novels are usually measured in terms of words; and there's a
| rule of thumb that four tokens make up about three words. So
| that 200k token wall you're hitting is right when most authors
| stop writing. 150k is _already_ considered long for a novel,
| and to train 1M properly, you 'd need not only a 750k book, but
| many of them. Humans just don't write or read that much text at
| once.
|
| To get around this, whoever is training these models would need
| to change their training strategy to either:
|
| - Group books in a series together as a single, very long text
| to be trained on
|
| - Train on multiple unrelated books at once in the same context
| window
|
| - Amplify the gradients by the length of the text being trained
| on so that the fewer long texts that do exist have greater
| influence on the model weights as a whole.
|
| I _suspect_ they 're doing #2, just to get _some_ gradients
| onto the longer end of the context window, but that also is
| going to diminish long-context reasoning because there 's no
| reason for the model to develop a connection between, say,
| token 32 and token 985,234.
| nneonneo wrote:
| I mean, can't they just train on some huge codebases? There's
| lots of 100KLOC codebases out there which would probably get
| close to 1M tokens.
| roflmaostc wrote:
| What about old books? Wikipedia? Law texts? Programming
| languages documentations?
|
| How many tokens is a 100 pages PDF? 10k to 100k?
| arvindh-manian wrote:
| For reference, I think a common approximation is one token
| being 0.75 words.
|
| For a 100 page book, that translates to around 50,000
| tokens. For 1 mil+ tokens, we need to be looking at 2000+
| page books. That's pretty rare, even for documentation.
|
| It doesn't have to be text-based, though. I could see films
| and TV shows becoming increasingly important for long-
| context model training.
| handfuloflight wrote:
| What about the role of synthetic data?
| throwup238 wrote:
| Synthetic data requires a discriminator that can select
| the highest quality results to feed back into training.
| Training a discriminator is easier than a full blown LLM,
| but it still suffers from a lack of high quality training
| data in the case of 1M context windows. How do you train
| a discriminator to select good 2,000 page synthetic books
| if the only ones you have to train it with are Proust and
| concatenated Harry Potter/Game of Thrones/etc.
| jjmarr wrote:
| Wikipedia does not have many pages that are 750k words.
| According to Special:LongPages[1], the longest page _right
| now_ is a little under 750k bytes.
|
| https://en.wikipedia.org/wiki/List_of_chiropterans
|
| Despite listing all presently known bats, the majority of
| "list of chiropterans" byte count is code that generates
| references to the IUCN Red List, not actual text. Most of
| Wikipedia's longest articles are code.
|
| [1] https://en.wikipedia.org/wiki/Special:LongPages
| crimsoneer wrote:
| Isn't the problem more that the "needle in a haystack" eval
| (i said word X once, where) is really not relevant to most
| long context LLM use cases like code, where you need the
| context from all the stuff simultaneously rather than
| identifying a single, quite separate relevant section?
| wskish wrote:
| codebases of high quality open source projects and their
| major dependencies are probably another good source. also:
| "transformative fair use", not "stolen"
| omneity wrote:
| I'm not sure to which extent this opinion is accurately
| informed. It is well known that nobody trains on 1M token-
| long content. It wouldn't work anyway as the dependencies are
| too far fetched and you end up with vanishing gradients.
|
| RoPE (Rotary Positional Embeddings, think modulo or periodic
| arithmetics) scaling is key, whereby the model is trained on
| 16k tokens long content, and then scaled up to 100k+ [0].
| Qwen 1M (who has near perfect recall over the complete window
| [1]) and Llama 4 10M pushed the limits of this technique,
| with Qwen reliably training with a much higher RoPE base, and
| Llama 4 coming up with iRoPE which claims scaling to
| extremely long contexts up to infinity.
|
| [0]: https://arxiv.org/html/2310.05209v2
|
| [1]: https://qwenlm.github.io/blog/qwen2.5-turbo/#passkey-
| retriev...
| christianqchung wrote:
| But Llama 4 Scout does badly on long context benchmarks
| despite claiming 10M. It scores 1 slot above Llama 3.1 8B
| in this one[1].
|
| [1] https://github.com/adobe-research/NoLiMa
| omneity wrote:
| Indeed, but it does not take away the fact that long
| context is not trained through long content but by
| scaling short content instead.
| kmeisthax wrote:
| Is there any evidence that GPT-4.1 is using RoPE to scale
| context?
|
| Also, I don't know about Qwen, but I know Llama 4 has
| severe performance issues, so I wouldn't use that as an
| example.
| omneity wrote:
| I am not sure about public evidence. But the memory
| requirements alone to train on 1M long windows would make
| it a very unrealistic proposition compared to RoPE
| scaling. And as I mentioned RoPE is essential for long
| context anyway. You can't train it in the "normal way".
| Please see the paper I linked previously for more context
| (pun not intended) on RoPE.
|
| Re: Llama 4, please see the sibling comment.
| daemonologist wrote:
| I ran NoLiMa on Quasar Alpha (GPT-4.1's stealth mode):
| https://news.ycombinator.com/item?id=43640166#43640790
|
| Updated results from the authors: https://github.com/adobe-
| research/NoLiMa
|
| It's the best known performer on this benchmark, but still
| falls off quickly at even relatively modest context lengths
| (85% perf at 16K). (Cutting edge reasoning models like Gemini
| 2.5 Pro haven't been evaluated due to their cost and might
| outperform it.)
| soheil wrote:
| Main takeaways:
|
| - Coding accuracy improved dramatically
|
| - Handles 1M-token context reliably
|
| - Much stronger instruction following
| theturtletalks wrote:
| With these being 1M context size, does that all but confirm that
| Quasar Alpha and Optimus Alpha were cloaked OpenAI models on
| OpenRouter?
| atemerev wrote:
| Yes, confirmed by citing Aider benchmarks:
| https://openai.com/index/gpt-4-1/
|
| Which means that these models are _absolutely_ not SOTA, and
| Gemini 2.5 pro is much better, and Sonnet is better, and even
| R1 is better.
|
| Sorry Sam, you are losing the game.
| Tinkeringz wrote:
| Aren't all of these reasoning models?
|
| Won't the reasoning models of openAI benchmarked against
| these be a test of if Sam is losing?
| atemerev wrote:
| There is no OpenAI model better than R1, reasoning or not
| (as confirmed by the same Aider benchmark; non-coding tests
| are less objective, but I think it still holds).
|
| With Gemini (current SOTA) and Sonnet (great potential, but
| tends to overengineer/overdo things) it is debatable, they
| are probably better than R1 (and all OpenAI models by
| extension).
| arvindh-manian wrote:
| I think Quasar is fairly confirmed [0] to be OpenAI.
|
| [0] https://x.com/OpenAI/status/1911782243640754634
| phoe18 wrote:
| Yes, OpenRouter confirmed it here -
| https://x.com/OpenRouterAI/status/1911833662464864452
| jmkni wrote:
| The increased context length is interesting.
|
| It would be incredible to be able to feed an entire codebase into
| a model and say "add this feature" or "we're having a bug where X
| is happening, tell me why", but then you are limited by the
| output token length
|
| As others have pointed out too, the more tokens you use, the less
| accuracy you get and the more it gets confused, I've noticed this
| too
|
| We are a ways away yet from being able to input an entire
| codebase, and have it give you back an updated version of that
| codebase.
| impure wrote:
| I like how Nano matches Gemini 2.0 Flash's price. That will help
| drive down prices which will be good for my app. However I don't
| like how Nano behaves worse than 4o Mini in some benchmarks.
| Maybe it will be good enough, we'll see.
| chaos_emergent wrote:
| Theory here is that 4.1-nano is competing with that tier, 4.1
| with flash-thinking (although likely to do significantly
| worse), and o4-mini or o3-large will compete with 2.5 thinking
| pzo wrote:
| yeah and considering that gemini 2.0 flash is much better than
| 4o-mini. On top of that gemini have also audio input as
| modality and realtime API for both audio input and output + web
| search grounding + free tier.
| pcwelder wrote:
| Can someone explain to me why we should take Aider's polyglot
| benchmark seriously?
|
| All the solutions are already available on the internet on which
| various models are trained, albeit in various ratios.
|
| Any variance could likely be due to the mix of the data.
| meroes wrote:
| To join in the faux rigor?
| philipbjorge wrote:
| If you're looking to test an LLMs ability to solve a coding
| task without prior knowledge of the task at hand, I don't think
| their benchmark is super useful.
|
| If you care about understanding relative performance between
| models for solving known problems and producing correct output
| format, it's pretty useful.
|
| - Even for well-known problems, we see a large distribution of
| quality between models (5 to 75% correctness) - Additionally,
| we see a large distribution of model's ability to produce
| responses in formats they were instructed in
|
| At the end of the day, benchmarks are pretty fuzzy, but I
| always welcome a formalized benchmark as a means to understand
| model performance over vibe checking.
| asdev wrote:
| > We will also begin deprecating GPT-4.5 Preview in the API, as
| GPT-4.1 offers improved or similar performance on many key
| capabilities at much lower cost and latency.
|
| why would they deprecate when it's the better model? too
| expensive?
| ComputerGuru wrote:
| > why would they deprecate when it's the better model? too
| expensive?
|
| Too expensive, but not for them - for their customers. The only
| reason they'd deprecated it is if it wasn't seeing usage worth
| keeping it up and that probably stems from it being insanely
| more expensive and slower than everything else.
| tootyskooty wrote:
| sits on too many GPUs, they mentioned it during the stream
|
| I'm guessing the (API) demand isn't there to saturate them
| fully
| simianwords wrote:
| Where did you find that 4.5 is a better model? Everything from
| the video told me that 4.5 was largely a mistake and 4.1 beats
| 4.5 at everything. There's no point keeping 4.5 at this point.
| rob wrote:
| Bigger numbers are supposed to mean better. 3.5, 4, 4.5.
| Going from 4 to 4.5 to 4.1 seems weird to most people. If
| it's better, it should of been GPT-4.6 or 5.0 or something
| else, not a downgraded number.
| HDThoreaun wrote:
| OpenAI has decided to troll via crappy naming conventions
| as a sort of in joke. Sam Altman tweets about it pretty
| often
| taikahessu wrote:
| > They feature a refreshed knowledge cutoff of June 2024.
|
| As opposed to Gemini 2.5 Pro having cutoff of Jan 2025.
|
| Honestly this feels underwhelming and surprising. Especially if
| you're coding with frameworks with breaking changes, this can
| hurt you.
| forbiddenvoid wrote:
| It's definitely an issue. Even the simplest use case of "create
| React app with Vite and Tailwind" is broken with these models
| right now because they're not up to date.
| asadm wrote:
| usually enabling "Search" fixes it sometimes as they fetch
| the newer methods.
| lukev wrote:
| Time to start moving back to Java & Spring.
|
| 100% backwards compatibility and well represented in 15 years
| worth of training data, hah.
| speedgoose wrote:
| Write once, run nowhere.
| Zambyte wrote:
| By "broken" you mean it doesn't use the latest and greatest
| hot trend, right? Or does it literally not work?
| dbbk wrote:
| Periodically I keep trying these coding models in Copilot
| and I have yet to have an experience where it produced
| working code with a pretty straightforward TypeScript
| codebase. Specifically, it cannot for the life of it
| produce working Drizzle code. It will hallucinate methods
| that don't exist despite throwing bright red type errors.
| Does it even check for TS errors?
| dalmo3 wrote:
| Not sure about Copilot, but the Cursor agent runs both
| eslint and tsc by default and fixes the errors
| automatically. You can tell it to run tests too, and
| whatever other tools. I've had a good experience writing
| drizzle schemas with it.
| taikahessu wrote:
| It has been really frustrating learning Godot (or any new
| technology you are not familiar with) 4.4.x with GPT4o or
| even worse, with custom GPT which use older GPT4turbo.
|
| As you are new in the field, it kinda doesn't make sense to
| pick an older version. It would be better if there was no
| data than incorrect data. You literally have to include the
| version number on every prompt and even that doesn't
| guarantee a right result! Sometimes I have to play truth or
| dare three times before we finally find the right names and
| instructions. Yes I have the version info on all custom
| information dialogs, but it is not as effective as
| including it in the prompt itself.
|
| Searching the web feels like an on-going "I'm feeling
| lucky" mode. Anyway, I still happen to get some real
| insights from GPT4o, even though Gemini 2.5 Pro has proven
| far superior for larger and more difficult contexts /
| problems.
|
| The best storytelling ideas have come from GPT 4.5. Looking
| forward to testing this new 4.1 as well.
| jonfw wrote:
| hey- curious what your experience has been like learning
| godot w/ LLM tooling.
|
| are you doing 3d? The 3D tutorial ecosystem is very GUI
| heavy and I have had major problems trying to get godot
| to do anything 3D
| yokto wrote:
| Whenever an LLM struggles with a particular library version,
| I use Cursor Rules to auto-include migration information and
| that generally worked well enough in my cases.
| tengbretson wrote:
| A few weeks back I couldn't even get ChatGPT to output
| TypeScript code that correctly used the OpenAI SDK.
| seuros wrote:
| You should give it documentation is can't guess.
| alangibson wrote:
| Try getting then to output Svelte 5 code...
| division_by_0 wrote:
| Svelte 5 is the antidote to vibe coding.
| int_19h wrote:
| Maybe LLMs will be the forcing function to finally slow down
| the crazy pace of changing (and breaking) things in
| JavaScript land.
| TIPSIO wrote:
| It it annoying. The bigger cheaper context windows help this a
| little though:
|
| E.g.: If context windows get big and cheap enough (as things
| are trending), hopefully you can just dump the entire docs,
| examples, and more in every request.
| j_maffe wrote:
| OAI are so ahead of the competition, they don't need to compare
| with the competition anymore /s
| neal_ wrote:
| hahahahaha
| forbiddenvoid wrote:
| Lots of improvements here (hopefully), but still no image
| generation updates, which is what I'm most eager for right now.
| taikahessu wrote:
| Or text to speech generation ... but I guess that is coming.
| dharmab wrote:
| Yeah, I tried the 4o models and they severely mispronounced
| common words and read numbers incorrectly (eg reading 16000
| as 1600)
| Tinkeringz wrote:
| They just realised a new image generation a couple of weeks
| ago, why are you eager for another one so soon?
| nanook wrote:
| Are the image generation improvements available via API?
| Don't think so
| ComputerGuru wrote:
| The benchmarks and charts they have up are frustrating because
| they don't include 03-mini(-high) which they've been pushing as
| the low-latency+low-cost smart model to use for coding challenges
| instead of 4o and 4o-mini. Why won't they include that in the
| charts?
| marsh_mellow wrote:
| From OpenAI's announcement:
|
| > Qodo tested GPT-4.1 head-to-head against Claude Sonnet 3.7 on
| generating high-quality code reviews from GitHub pull requests.
| Across 200 real-world pull requests with the same prompts and
| conditions, they found that GPT-4.1 produced the better
| suggestion in 55% of cases. Notably, they found that GPT-4.1
| excels at both precision (knowing when not to make suggestions)
| and comprehensiveness (providing thorough analysis when
| warranted).
|
| https://www.qodo.ai/blog/benchmarked-gpt-4-1/
| arvindh-manian wrote:
| Interesting link. Worth noting that the pull requests were
| judged by o3-mini. Further, I'm not sure that 55% vs 45% is a
| huge difference.
| marsh_mellow wrote:
| Good point. They said they validated the results by testing
| with other models (including Claude), as well as with manual
| sanity checks.
|
| 55% to 45% definitely isn't a blowout but it is meaningful --
| in terms of ELO it equates to about a 36 point difference. So
| not in a different league but definitely a clear edge
| InkCanon wrote:
| >4.1 Was better in 55% of cases
|
| Um, isn't that just a fancy way of saying it is slightly better
|
| >Score of 6.81 against 6.66
|
| So very slightly better
| kevmo314 wrote:
| A great way to upsell 2% better! I should start doing that.
| neuroelectron wrote:
| Good marketing if you're selling a discount all purpose
| cleaner, not so much for an API.
| marsh_mellow wrote:
| I don't think the absolute score means much -- judge models
| have a tendency to score around 7/10 lol
|
| 55% vs. 45% equates to about a 36 point difference in ELO. in
| chess that would be two players in the same league but one
| with a clear edge
| kevmo314 wrote:
| Rarely are two models put head-to-head though. If Claude
| Sonnet 3.7 isn't able to generate a good PR review (for
| whatever reason), a 2% better review isn't all that strong
| of a value proposition.
| swyx wrote:
| the point is oai is saying they have a viable Claude
| Sonnet competitor now
| wiz21c wrote:
| "they found that GPT-4.1 excels at both precision..."
|
| They didn't say it is better than Claude at precision etc.
| Just that it excels.
|
| Unfortunately, AI has still not concluded that manipulations
| by the marketing dept is a plague...
| jsnell wrote:
| That's not a lot of samples for such a small effect, I don't
| think it's statistically significant (p-value of around 10%).
| swyx wrote:
| is there a shorthand/heuristic to calculate pvalue given n
| samples and effect size?
| marsh_mellow wrote:
| p-value of 7.9% -- so very close to statistical significance.
|
| the p-value for GPT-4.1 having a win rate of at least 49% is
| 4.92%, so we can say conclusively that GPT-4.1 is at least
| (essentially) evenly matched with Claude Sonnet 3.7, if not
| better.
|
| Given that Claude Sonnet 3.7 has been generally considered to
| be the best (non-reasoning) model for coding, and given that
| GPT-4.1 is substantially cheaper ($2/million input,
| $8/million output vs. $3/million input, $15/million output),
| I think it's safe to say that this is significant news,
| although not a game changer
| jsnell wrote:
| I make it 8.9% with a binomial test[0]. I rounded that to
| 10%, because any more precision than that was not
| justified.
|
| Specifically, the results from the blog post are
| impossible: with 200 samples, you can't possibly have the
| claimed 54.9/45.1 split of binary outcomes. Either they
| didn't actually make 200 tests but some other number, they
| didn't actually get the results they reported, or they did
| some kind of undocumented data munging like excluding all
| tied results. In any case, the uncertainty about the input
| data is larger than the uncertainty from the rounding.
|
| [0] In R, binom.test(110, 200, 0.5, alternative="greater")
| simianwords wrote:
| Could any one guess the reason as to why they didn't ship this in
| the chat UI?
| KoolKat23 wrote:
| The memory thing? More resources intensive?
| nikcub wrote:
| Easy to miss in the announcement that 4.5 is being shut down
|
| > GPT-4.5 Preview will be turned off in three months, on July 14,
| 2025
| OxfordOutlander wrote:
| Juice not worth the squeeze I imagine. 4.5 is chonky, and
| having to reserve GPU space for it must not have been worth it.
| Makes sense to me - I hadn't founding anything it was so much
| better at that it was worth the incremental cost over Sonnet
| 3.7 or o3-mini.
| pcwelder wrote:
| Did some quick tests. I believe its the same model as Quasar. It
| struggles with agentic loop [1]. You'd have to force it to do
| tool calls.
|
| Tool use ability feels ability better than gemini-2.5-pro-exp [2]
| which struggles with JSON schema understanding sometimes.
|
| Llama 4 has suprising agentic capabilities, better than both of
| them [3] but isn't as intelligent as the others.
|
| [1]
| https://github.com/rusiaaman/chat.md/blob/main/samples/4.1/t...
|
| [2]
| https://github.com/rusiaaman/chat.md/blob/main/samples/gemin...
|
| [3]
| https://github.com/rusiaaman/chat.md/blob/main/samples/llama...
| ludwik wrote:
| Correct. They've mentioned the name during the live
| announcement - https://www.youtube.com/live/kA-P9ood-
| cE?si=GYosi4FtX1YSAujE...
| simonw wrote:
| Here's a summary of this Hacker News thread created by GPT-4.1
| (the full sized model) when the conversation hit 164 comments:
| https://gist.github.com/simonw/93b2a67a54667ac46a247e7c5a2fe...
|
| I think it did very well - it's clearly good at instruction
| following.
|
| Total token cost: 11,758 input, 2,743 output = 4.546 cents.
|
| Same experiment run with GPT-4.1 mini:
| https://gist.github.com/simonw/325e6e5e63d449cc5394e92b8f2a3...
| (0.8802 cents)
|
| And GPT-4.1 nano:
| https://gist.github.com/simonw/1d19f034edf285a788245b7b08734...
| (0.2018 cents)
| swyx wrote:
| don't miss that OAI also published a prompting guide WITH
| RECEIPTS for GPT 4.1 specifically for those building agents...
| with a new recommendation for:
|
| - telling the model to be persistent (+20%)
|
| - dont self-inject/parse toolcalls (+2%)
|
| - prompted planning (+4%)
|
| - JSON BAD - use XML or arxiv 2406.13121 (GDM format)
|
| - put instructions + user query at TOP -and- BOTTOM - bottom-only
| is VERY BAD
|
| - no evidence that ALL CAPS or Bribes or Tips or threats to
| grandma work
|
| source:
| https://cookbook.openai.com/examples/gpt4-1_prompting_guide#...
| simonw wrote:
| I'm surprised and a little disappointed by the result
| concerning instructions at the top, because it's incompatible
| with prompt caching: I would much rather cache the part of the
| prompt that includes the long document and then swap out the
| user question at the end.
| swyx wrote:
| yep. we address it in the podcast. presumably this is just a
| recent discovery and can be post-trained away.
| aoeusnth1 wrote:
| If you're skimming a text to answer a specific question,
| you can go a lot faster than if you have to memorize the
| text well enough to answer an unknown question after the
| fact.
| zaptrem wrote:
| Prompt on bottom is also easier for humans to read as I can
| have my actual question and the model's answer on screen at
| the same time instead of scrolling through 70k tokens of
| context between them.
| mmoskal wrote:
| The way I understand it: if the instruction are at the top,
| the KV entries computed for "content" can be influenced by
| the instructions - the model can "focus" on what you're
| asking it to do and perform some computation, while it's
| "reading" the content. Otherwise, you're completely relaying
| on attention to find the information in the content, leaving
| it much less token space to "think".
| swyx wrote:
| references for all the above + added more notes here on pricing
| https://x.com/swyx/status/1911849229188022278
|
| and we'll be publishing our 4.1 pod later today
| https://www.youtube.com/@latentspacepod
| pton_xd wrote:
| As an aside, one of the worst aspects of the rise of LLMs, for
| me, has been the wholesale replacement of engineering with
| trial-and-error hand-waving. Try this, or maybe that, and maybe
| you'll see a +5% improvement. Why? Who knows.
|
| It's just not how I like to work.
| pclmulqdq wrote:
| Software engineering has involved a lot of people doing
| trial-and-error hand-waving for at least a decade. We are now
| codifying the trend.
| zoogeny wrote:
| I think trial-and-error hand-waving isn't all that far from
| experimentation.
|
| As an aside, I was working in the games industry when multi-
| core was brand new. Maybe Xbox-360 and PS3? I'm hazy on the
| exact consoles but there was one generation where the major
| platforms all went multi-core.
|
| No one knew how to best use the multi-core systems for
| gaming. I attended numerous tech talks by teams that had
| tried different approaches and were give similar "maybe do
| this and maybe see x% improvement?". There was a lot of
| experimentation. It took a few years before things settled
| and best practices became even somewhat standardized.
|
| Some people found that era frustrating and didn't like to
| work in that way. Others loved the fact it was a wide open
| field of study where they could discover things.
| jorvi wrote:
| Yes, it was the generation of the X360 and PS3. X360 was
| triple core and the PS3 was 1+7 core (sort of a big.little
| setup).
|
| Although it took many, many more years until games started
| to actually use multi-core properly. With rendering being
| on a 16.67ms / 8.33ms budget and rendering tied to world
| state, it was just really hard to not tie everything into
| eachother.
|
| Even today you'll usually only see 2-4 cores actually
| getting significant load.
| brokencode wrote:
| Out of curiosity, what do you work on where you don't have to
| experiment with different solutions to see what works best?
| FridgeSeal wrote:
| Usually when we're doing it in practice there's _somewhat_
| more awareness of the mechanics than just throwing random
| obstructions in and hoping for the best.
| RussianCow wrote:
| LLMs are still very young. We'll get there in time. I
| don't see how it's any different than optimizing for new
| CPU/GPU architectures other than the fact that the latter
| is now a decades-old practice.
| girvo wrote:
| > I don't see how it's any different than optimizing for
| new CPU/GPU architectures
|
| I mean that seems wild to say to me. Those architectures
| have documentation and aren't magic black boxes that we
| chuck inputs at and hope for the best: we do pretty much
| that with LLMs.
|
| If that's how you optimise, I'm genuinely shocked.
| greenchair wrote:
| most people are building straightforward crud apps. no
| experimentation required.
| RussianCow wrote:
| [citation needed]
|
| In my experience, even simple CRUD apps generally have
| some domain-specific intricacies or edge cases that take
| some amount of experimentation to get right.
| brokencode wrote:
| Idk, it feels like this is what you'd expect versus the
| actual reality of building something.
|
| From my experience, even building on popular platforms,
| there are many bugs or poorly documented behaviors in
| core controls or APIs.
|
| And performance issues in particular can be difficult to
| fix without trial and error.
| kitsunemax wrote:
| I feel like this a common pattern with people who work in
| STEM. As someone who is used to working with formal proofs,
| equations, math, having a startup taught me how to rewire
| myself to work with the unknowns, imperfect solutions, messy
| details. I'm going on a tangent, but just wanted to share.
| behnamoh wrote:
| > - JSON BAD - use XML or arxiv 2406.13121 (GDM format)
|
| And yet, all function calling and MCP is done through JSON...
| CSMastermind wrote:
| Yeah anyone who has worked with these models knows how much
| they struggle with JSON inputs.
| swyx wrote:
| JSON is just MCP's transport layer. you can reformat to xml
| to pass into model
| Havoc wrote:
| >- dont self-inject/parse toolcalls (+2%)
|
| What is meant by this?
| intalentive wrote:
| Use the OpenAI API/SDK for function calling instead of
| rolling your own inside the prompt.
| minimaxir wrote:
| > no evidence that ALL CAPS or Bribes or Tips or threats to
| grandma work
|
| Challenge accepted.
|
| That said, the exact quote from the linked notebook is "It's
| generally not necessary to use all-caps or other incentives
| like bribes or tips, but developers can experiment with this
| for extra emphasis if so desired.", but the demo examples
| OpenAI provides do like using ALL CAPS.
| kristianp wrote:
| The size of that SWE-bench Verified prompt shows how much work
| has gone into the prompt to get the highest possible score for
| that model. A third party might go to a model from a different
| provider before going to that extent of fine-tuning of the
| prompt.
| frognumber wrote:
| Marginally on-topic: I'd love if the charts included prior
| models, including GPT 4 and 3.5.
|
| Not all systems upgrade every few months. A major question is
| when we reach step-improvements in performance warranting a re-
| eval, redesign of prompts, etc.
|
| There's a small bleeding edge, and a much larger number of
| followers.
| bartkappenburg wrote:
| By leaving out scale or prior models they are effectively
| manipulating improvement. If from 3 to 4 it was from 10 to 80,
| and from 4 to 4o it was 80 to 82, leaving out 3 would let us see
| a steep line instead of steep decrease of growth.
|
| Lies, damn lies and statistics ;-)
| growt wrote:
| My theory: they need to move off the 4o version number before
| releasing o4-mini next week or so.
| kgeist wrote:
| The 'oN' schema was a such strange choice for branding. They
| had to skip 'o2' because it's already trademarked, and now 'o4'
| can easily be confused with '4o'.
| neal_ wrote:
| The better the benchmarks, the worse the model is. Subjectively
| for me the more advanced models dont follow instructions, and are
| less capable of implementing features or building stuff. I could
| not tell a difference in blind testing SOTA models gemini,
| claude, openai, deepseek. There has been no major improvements in
| the LLM space since the original models gained popularity. Each
| release claims to be much better the last, and every time i have
| been disappointed and think this is worse.
|
| First it was the models stopped putting in effort and felt lazy,
| tell it to do something and it will tell you to do it your self.
| Now its the opposite and the models go ham changing everything
| they see, instead of changing one line, SOTA models rather
| rewrite the whole project and still not fix the issue.
|
| Two years back I totally thought these models are amazing. I
| always would test out the newest models and would get hyped up
| about it. Every problem i had i thought if i just prompt it
| differently I can get it to solve this. Often times i have spent
| hours prompting starting new chats, adding more context. Now i
| realize its kinda useless and its better to just accept the
| models where they are, rather then try and make them a one stop
| shop, or try to stretch capabilities.
|
| I think this release I won't even test it out, im not interested
| anymore. I'll probably just continue using deepseek free, and
| gemini free. I canceled my openai subscription like 6 months ago,
| and canceled claude after 3.7 disappointment.
| T3uZr5Fg wrote:
| While impressive that the assistants can use dynamic tools and
| reason about images, I'm most excited about the improvements to
| factual accuracy and instruction following. The RAG capabilities
| with cross-validation seem particularly useful for serious
| development work, not just toy demos.
| 999900000999 wrote:
| Have they implemented "I don't know" yet.
|
| I probably spend 100$ a month on AI coding, and it's great at
| small straightforward tasks.
|
| Drop it into a larger codebase and it'll get confused. Even if
| the same tool built it in the first place due to context limits.
|
| Then again, the way things are rapidly improving I suspect I can
| wait 6 months and they'll have a model that can do what I want.
| cheschire wrote:
| I wonder if documentation would help to create an carefully and
| intentionally tokenized overview of the system. Maximize the
| amount of routine larger scope information provided in minimal
| tokens in order to leave room for more immediate context.
|
| Similar to the function documentation provides to developers
| today, I suppose.
| yokto wrote:
| It does, shockingly well in my experience. Check out this
| blog post outlining such an approach, called Literate
| Development by the author:
| https://news.ycombinator.com/item?id=43524673
| vinhnx wrote:
| * Flagship GPT-4.1: top-tier intelligence, full endpoints &
| premium features
|
| * GPT-4.1-mini: balances performance, speed & cost
|
| * GPT-4.1-nano: prioritizes throughput & low cost with
| streamlined capabilities
|
| All share a 1 million-token context window (vs 120-200k on
| 4o-o3/o1), excelling in instruction following, tool calls &
| coding.
|
| Benchmarks vs prior models:
|
| * AIME '24: 48.1% vs 13.1% (~3.7x gain)
|
| * MMLU: 90.2% vs 85.7% (+4.5 pp)
|
| * Video-MME: 72.0% vs 65.3% (+6.7 pp)
|
| * SWE-bench Verified: 54.6% vs 33.2% (+21.4 pp)
| comex wrote:
| Sam Altman wrote in February that GPT-4.5 would be "our last non-
| chain-of-thought model" [1], but GPT-4.1 also does not have
| internal chain-of-thought [2].
|
| It seems like OpenAI keeps changing its plans. Deprecating
| GPT-4.5 less than 2 months after introducing it also seems
| unlikely to be the original plan. Changing plans is necessarily a
| bad thing, but I wonder why.
|
| Did they not expect this model to turn out as well as it did?
|
| [1] https://x.com/sama/status/1889755723078443244
|
| [2] https://github.com/openai/openai-
| cookbook/blob/6a47d53c967a0...
| wongarsu wrote:
| Maybe that's why they named this model 4.1, despite coming out
| after 4.5 and supposedly outperforming it. They can pretend
| GPT-4.5 is the last non-chain-of-thought model by just giving
| all non-chain-of-thought-models version numbers below 4.5
| chrisweekly wrote:
| Ok, I know naming things is hard, but 4.1 comes out after
| 4.5? Just, wat.
| CamperBob2 wrote:
| For a long time, you could fool models with questions like
| "Which is greater, 4.10 or 4.5?" Maybe they're still
| struggling with that at OpenAI.
| ben_w wrote:
| At this point, I'm just assuming most AI models -- not
| just OpenAI's -- name themselves. And that they write
| their own press releases.
| Cheer2171 wrote:
| Why do you expect to believe a single word Sam Altman says?
| sigmoid10 wrote:
| Everyone assumed malice when the board fired him for not
| always being "candid" - but it seems more and more that he's
| just clueless. He's definitely capable when it comes to
| raising money as a business, but I wouldn't count on any tech
| opinion from him.
| observationist wrote:
| Anyone making claims with a horizon beyond two months about
| structure or capabilities will be wrong - it's sama's job to
| show confidence and vision and calm stakeholders, but if you're
| paying attention to the field, the release and research cycles
| are still contracting, with no sense of slowing any time soon.
| I've followed AI research daily since GPT-2, the momentum is
| incredible, and even if the industry sticks with transformers,
| there are years left of low hanging fruit and incremental
| improvements before things start slowing.
|
| There doesn't appear to be anything that these AI models cannot
| do, in principle, given sufficient data and compute. They've
| figured out multimodality and complex integration, self play
| for arbitrary domains, and lots of high-cost longer term
| paradigms that will push capabilities forwards for at least 2
| decades in conjunction with Moore's law.
|
| Things are going to continue getting better, faster, and
| weirder. If someone is making confident predictions beyond
| those claims, it's probably their job.
| sottol wrote:
| Maybe that's true for absolute arm-chair-engineering
| outsiders (like me) but these models are in training for
| months, training data is probably being prepared year(s) in
| advance. These models have a knowledge cut-off in 2024 - so
| they have been in training for a while. There's no way sama
| did not have a good idea that this non-COT was in the
| pipeline 2 months ago. It was probably finished training then
| and undergoing evals.
|
| Maybe
|
| 1. he's just doing his job and hyping OpenAI's competitive
| advantages (afair most of the competition didn't have decent
| COT models in Feb), or
|
| 2. something changed and they're releasing models now that
| they didn't intend to release 2 months ago (maybe because a
| model they did intend to release is not ready and won't be
| for a while), or
|
| 3. COT is not really as advantageous as it was deemed to be
| 2+ months ago and/or computationally too expensive.
| fragmede wrote:
| With new hardware from Nvidia announced coming out, those
| months turn into weeks.
| sottol wrote:
| I doubt it's going to be weeks, the months were already
| turning into years despite Nvidia's previous advances.
|
| (Not to say that it takes openai years to train a new
| model, just that the timeline between major GPT releases
| seems to double... be it for data gathering, training,
| taking breaks between training generations, ... - either
| way, model training seems to get harder not easier).
|
| GPT Model | Release Date | Months Passed Between Former
| Model
|
| GPT-1 | 11.06.2018
|
| GPT-2 | 14.02.2019 | 8.16
|
| GPT-3 | 28.05.2020 | 15.43
|
| GPT-4 | 14.03.2023 | 33.55
|
| [1]https://www.lesswrong.com/posts/BWMKzBunEhMGfpEgo/when
| -will-...
| moojacob wrote:
| > Things are going to continue getting better, faster, and
| weirder.
|
| I love this. Especially the weirder part. This tech can be
| useful in every crevice of society and we still have no idea
| what new creative use cases there are.
|
| Who would've guessed phones and social media would cause mass
| protests because bystanders could record and distribute
| videos of the police?
| staunton wrote:
| > Who would've guessed phones and social media would cause
| mass protests because bystanders could record and
| distribute videos of the police?
|
| That would have been quite far down on my list of "major
| (unexpected) consequences of phones and social media"...
| authorfly wrote:
| the release and research cycles are still contracting
|
| Not necessarily progress or benchmarks that as a broader
| picture you would look at (MMLU etc)
|
| GPT-3 was an amazing step up from GPT-2, something scientists
| in the field really thought was 10-15 years out at least done
| in 2, instruct/RHLF for GPTs was a similar massive splash,
| making the second half of 2021 equally amazing.
|
| However nothing since has really been that left field or
| unpredictable from then, and it's been almost 3 years since
| RHLF hit the field. We knew good image understanding as
| input, longer context, and improved prompting would improve
| results. The releases are common, but the progress feels like
| it has stalled for me.
|
| What really has changed since Davinci-instruct or ChatGPT to
| you? When making an AI-using product, do you construct it
| differently? Are agents presently more than APIs talking to
| databases with private fields?
| hectormalot wrote:
| In some dimensions I recognize the slow down in how fast
| new capabilities develop, but the speed still feels very
| high:
|
| Image generation suddenly went from gimmick to useful now
| that prompt adherence is so much better (eagerly waiting
| for that to be in the API)
|
| Coding performance continues to improve noticeably (for
| me). Claude 3.7 felt like a big step from 4o/3.5. Gemini
| 2.5 in a similar way.compared to just 6 months ago I can
| give bigger and more complex pieces of work to it and get
| relatively good output back. (Net acceleration)
|
| Audio-2-audio seems like it will be a big step as well. I
| think this has much more potential than the STT-LLM-TTS
| architecture commonly used today (latency, quality)
| liamwire wrote:
| Excuse the pedantry; for those reading, it's RLHF rather
| than RHLF.
| kadushka wrote:
| I see a huge progress made since the first gpt-4 release.
| The reliability of answers has improved an order of
| magnitude. Two years ago, more than half of my questions
| resulted in incorrect or partially correct answers (most of
| my queries are about complicated software algorithms or phd
| level research brainstorming). A simple "are you sure"
| prompt would force the model to admit it was wrong most of
| the time. Now with o1 this almost never happens and the
| model seems to be smarter or at least more capable than me
| - in general. GPT-4 was a bright high school student. o1 is
| a postdoc.
| adamgordonbell wrote:
| Perhaps it is a distilled 4.5, or based on it's lineage, as
| some suggested.
| zitterbewegung wrote:
| I think that people balked at the cost of 4.5 and really wanted
| just a slightly more improved 4o . Now it almost seems that
| they will have a separate products that are non chain of
| thought and chain of thought series which actually makes sense
| because some want a cheap model and some don't.
| freehorse wrote:
| > Deprecating GPT-4.5 less than 2 months after introducing it
| also seems unlikely to be the original plan.
|
| Well they actually hinted already of possible depreciation in
| their initial announcement of gpt4.5 [0]. Also, as others said,
| this model was already offered in the api as chatgpt-latest,
| but there was no checkpoint which made it unreliable for actual
| use.
|
| [0] https://openai.com/index/introducing-
| gpt-4-5/#:~:text=we%E2%...
| resource_waste wrote:
| When I saw them say 'no more non COT models', I was minorly
| panicked.
|
| While their competitors have made fantastic models, at the time
| I perceived ChatGPT4 was the best model for many applications.
| COT was often tricked by my prompts, assuming things to be
| true, when a non-COT model would say something like 'That isnt
| necessarily the case'.
|
| I use both COT and non when I have an important problem.
|
| Seeing them keep a non-COT model around is a good idea.
| gcy wrote:
| 4.10 > 4.5 -- @stevenheidel
|
| @sama: underrated tweet
|
| Source: https://x.com/stevenheidel/status/1911833398588719274
| wongarsu wrote:
| Too bad OpenAI named it 4.1 instead of 4.10. You can either
| claim 4.10 > 4.5 (the dots separate natural numbers) or 4.1 ==
| 4.10 (they are decimal numbers), but you can't have both at
| once
| stevenheidel wrote:
| so true
| furyofantares wrote:
| It's another Daft Punk day. Change a string in your program* and
| it's better, faster, cheaper: pick 3.
|
| *Then fix all your prompts over the next two weeks.
| wongarsu wrote:
| Is the version number a retcon of 4.5? On OpenAI's models page
| the names appear completely reasonable [1]: The o1 and o3
| reasoning models, and non-reasoning there is 3.5, 4, 4o and 4.1
| (let's pretend 4o makes sense). But that is only reasonable as
| long as we pretend 4.5 never happened, which the models page
| apparently does
|
| 1: https://platform.openai.com/docs/models
| esafak wrote:
| More information here:
| https://platform.openai.com/docs/models/gpt-4.1
| https://platform.openai.com/docs/models/gpt-4.1-mini
| https://platform.openai.com/docs/models/gpt-4.1-nano
| LeicaLatte wrote:
| i've recently set claude 3.7 as the default option for customers
| when they start new chats in my app. this was a recent change,
| and i'm feeling good about it. supporting multiple providers can
| be a nightmare for customer service, especially when it comes to
| billing and handling response quality queries. with so many
| choices from just one provider, it simplifies things
| significantly. curious about how openai manages customer service
| internally.
| bbstats wrote:
| ok.
| XCSme wrote:
| I tried 4.1-mini and 4.1-nano. The response are a lot faster, but
| for my use-case they seem to be a lot worse than 4o-mini(they
| fail to complete the task when 4o-mini could do it). Maybe I have
| to update my prompts...
| XCSme wrote:
| Even after updating my prompts, 4o-mini still seems to do
| better than 4.1-mini or 4.1-nano for a data-processing task.
| BOOSTERHIDROGEN wrote:
| Mind sharing your system prompt?
| XCSme wrote:
| It's quite complex, but the task is to parse some HTML
| content, or to choose from a list of URLs which one is the
| best.
|
| I will check again the prompt, maybe 4o-mini ignores some
| instructions that 4.1 doesn't (instructions which might
| result in the LLM returning zero data).
| pbmango wrote:
| I think an under appreciated reality is that all of the large AI
| labs and OpenAI in particular are fighting multiple market
| battles at once. This is coming across in both the number of
| products and the packaging.
|
| 1, to win consumer growth they have continued to benefit on hyper
| viral moments, lately that was was image generation in 4o, which
| likely was technically possible a long time before launched. 2,
| for enterprise workloads and large API use, they seem to have
| focused less lately but the pricing of 4.1 is clearly an answer
| to Gemini which has been winning on ultra high volume and
| consistency. 3, for full frontier benchmarks they pushed out 4.5
| to stay SOTA and attract the best researchers. 4, on top of all
| they they had to, and did, quickly answer the reasoning promise
| and DeepSeek threat with faster and cheaper o models.
|
| They are still winning many of these battles but history
| highlights how hard multi front warfare is, at least for teams of
| humans.
| spiderfarmer wrote:
| On that note, I want to see benchmarks for which LLM's are best
| at translating between languages. To me, it's an entire product
| category.
| pbmango wrote:
| There are probably many more small battles being fought or
| emerging. I think voice and PDF parsing are growing battles
| too.
| kristianp wrote:
| I agree. 4.1 seems to be a release that addresses shortcomings
| of 4o in coding compared to Claude 3.7 and Gemini 2.0 and 2.5
| pastureofplenty wrote:
| The plagiarism machine got an update! Yay!
| archeantus wrote:
| "GPT-4.1 scores 54.6% on SWE-bench Verified, improving by
| 21.4%abs over GPT-4o and 26.6%abs over GPT-4.5--making it a
| leading model for coding."
|
| 4.1 is 26.6% better at coding than 4.5. Got it. Also...see the em
| dash
| drexlspivey wrote:
| Should have named it 4.10
| pdabbadabba wrote:
| What's wrong with the em-dash? That's just...the
| typographically correct dash AFAIK.
| sharkjacobs wrote:
| > You're eligible for free daily usage on traffic shared with
| OpenAI through April 30, 2025. > Up to 1 million tokens
| per day across gpt-4.5-preview, gpt-4.1, gpt-4o and o1 >
| Up to 10 million tokens per day across gpt-4.1-mini,
| gpt-4.1-nano, gpt-4o-mini, o1-mini and o3-mini > Usage
| beyond these limits, as well as usage for other models, will be
| billed at standard rates. Some limitations apply.
|
| I just found this option in
| https://platform.openai.com/settings/organization/data-contr...
|
| Is just this something I haven't noticed before? Or is this new?
| XCSme wrote:
| So, that's like $10/day to give all your data/prompts?
| sacrosaunt wrote:
| Not new, launched in December 2024.
| https://community.openai.com/t/free-tokens-on-traffic-shared...
| __mharrison__ wrote:
| I know this is somewhat off topic, but can someone explain the
| naming convention used by OpenAI? Number vs "mini" vs "o" vs
| "turbo" vs "chat"?
| iteratethis wrote:
| Mini means the size of the model (less parameters)
|
| "o" means "omni", which means its multimodal.
| kristianp wrote:
| Looks like the Quasar and Optimus stealth models on Openrouter
| were in fact GPT-4.1. This is what I get when I try to access the
| openrouter/optimus-alpha model now: {"error":
| {"message":"Quasar and Optimus were stealth models, and
| revealed on April 14th as early testing versions of GPT 4.1.
| Check it out: https://openrouter.ai/openai/gpt-4.1","code":404}
| lxgr wrote:
| As a ChatGPT user, I'm weirdly happy that it's not available
| there yet. I already have to make a conscious choice between
|
| - 4o (can search the web, use Canvas, evaluate Python server-
| side, generate images, but has no chain of thought)
|
| - o3-mini (web search, CoT, canvas, but no image generation)
|
| - o1 (CoT, maybe better than o3, but no canvas or web search and
| also no images)
|
| - Deep Research (very powerful, but I have only 10 attempts per
| month, so I end up using roughly zero)
|
| - 4.5 (better in creative writing, and probably warmer sound
| thanks to being vinyl based and using analog tube amplifiers, but
| slower and request limited, and I don't even know which of the
| other features it supports)
|
| - 4o "with scheduled tasks" (why on earth is that a model and not
| a tool that the other models can use!?)
|
| Why do I have to figure all of this out myself?
| fragmede wrote:
| what's hilarious to me is that I asked ChatGPT about the model
| names and approachs and it did a better job than they have.
| resters wrote:
| I use them as follows:
|
| o1-pro: anything important involving accuracy or reasoning.
| Does the best at accomplishing things correctly in one go even
| with lots of context.
|
| deepseek R1: anything where I want high quality non-academic
| prose or poetry. Hands down the best model for these. Also very
| solid for fast and interesting analytical takes. I love
| bouncing ideas around with R1 and Grok-3 bc of their fast
| responses and reasoning. I think R1 is the most creative yet
| also the best at mimicking prose styles and tone. I've
| speculated that Grok-3 is R1 with mods and think it's
| reasonably likely.
|
| 4o: image generation, occasionally something else but never for
| code or analysis. Can't wait till it can generate accurate
| technical diagrams from text.
|
| o3-mini-high and grok-3: code or analysis that I don't want to
| wait for o1-pro to complete.
|
| claude 3.7: occasionally for code if the other models are
| making lots of errors. Sometimes models will anchor to outdated
| information in spite of being informed of newer information.
|
| gemini models: occasionally I test to see if they are
| competitive, so far not really, though I sense they are good at
| certain things. Excited to try 2.5 Deep Research more, as it
| seems promising.
|
| Perplexity: discontinued subscription once the search
| functionality in other models improved.
|
| I'm really looking forward to o3-pro. Let's hope it's available
| soon as there are some things I'm working on that are on hold
| waiting for it.
| motoboi wrote:
| You probably know this but it can already generate accurate
| diagrams. Just ask for the output in a diagram language like
| mermaid or graphviz
| bangaladore wrote:
| My experience is it often produces terrible diagrams.
| Things clearly overlap, lines make no sense. I'm not
| surprised as if you told me to layout a diagram in XML/YAML
| there would be obvious mistakes and layout issues.
|
| I'm not really certain a text output model can ever do well
| here.
| resters wrote:
| FWIW I think a multimodal model could be trained to do
| extremely well with it given sufficient training data. A
| combination of textual description of the system and/or
| diagram, source code (mermaid, SVG, etc.) for the
| diagram, and the resulting image, with training to
| translate between all three.
| bangaladore wrote:
| Agreed. Even simply I'm sure a service like this already
| exists (or could easily exist) where the workflow is
| something like:
|
| 1. User provides information
|
| 2. LLM generates structured output for whatever modeling
| language
|
| 3. Same or other multimodal LLM reviews the generated
| graph for styling / positioning issues and ensure its
| matches user request.
|
| 4. LLM generates structured output based on the feedback.
|
| 5. etc...
|
| But you could probably fine-tune a multimodal model to do
| it in one shot, or way more effectively.
| resters wrote:
| I've had mixed and inconsistent results and it hasn't been
| able to iterate effectively when it gets close. Could be
| that I need to refine my approach to prompting. I've tried
| mermaid and SVG mostly, but will also try graphviz based on
| your suggestion.
| shortcord wrote:
| Gemini 2.5 Pro is quite good at code.
|
| Has become my go to for use in Cursor. Claude 3.7 needs to be
| restrained too much.
| throwup238 wrote:
| _> - Deep Research (very powerful, but I have only 10 attempts
| per month, so I end up using roughly zero)_
|
| Same here, which is a real shame. I've switched to DeepResearch
| with Gemini 2.5 Pro over the last few days where paid users
| have a limit 20/day limit instead of 10/month and it's been
| great, especially since now Gemini seems to browse 10x more
| pages than OpenAI Deep Research (on the order of 200-400 pages
| versus 20-40).
|
| The reports are too verbose but having it research random
| development ideas, or how to do something particularly complex
| with a specific library, or different approaches or
| architectures to a problem has been very productive without
| sliding into vibe coding territory.
| cafeinux wrote:
| > 4.5 (better in creative writing, and probably warmer sound
| thanks to being vinyl based and using analog tube amplifiers,
| but slower and request limited, and I don't even know which of
| the other features it supports)
|
| Is that an LLM hallucination?
| htrp wrote:
| anyone want to guess parameter sizes here for
|
| GPT-4.1, GPT-4.1 mini GPT-4.1 nano
|
| I'll start with
|
| 800 bn MoE (probably 120 bn activated), 200 bn MoE (33 bn
| activated), and 7bn parameter for nano
| i_love_retros wrote:
| I feel overwhelmed
| Ninjinka wrote:
| I've been using it in Cursor for the past few hours and prefer it
| to Sonnet 3.7. It's much faster and doesn't seem to make the sort
| of stupid mistakes Sonnet has been making recently.
| omneity wrote:
| I have been trying GPT-4.1 for a few hours by now through Cursor
| on a fairly complicated code base. For reference, my gold
| standard for a coding agent is Claude Sonnet 3.7 despite its
| tendency to diverge and lose focus.
|
| My take aways:
|
| - This is the first model from OpenAI that feels relatively
| agentic to me (o3-mini sucks at tool use, 4o just sucks). It
| seems to be able to piece together several tools to reach the
| desired goal and follows a roughly coherent plan.
|
| - There is still more work to do here. Despite OpenAI's
| cookbook[0] and some prompt engineering on my side, GPT-4.1 stops
| quickly to ask questions, getting into a quite useless "convo
| mode". Its tool calls fails way too often as well in my opinion.
|
| - It's also able to handle significantly less complexity than
| Claude, resulting in some comical failures. Where Claude would
| create server endpoints, frontend components and routes and
| connect the two, GPT-4.1 creates simplistic UI that calls a mock
| API despite explicit instructions. When prompted to fix it, it
| went haywire and couldn't handle the multiple scopes involved in
| that test app.
|
| - With that said, within all these parameters, it's much less
| unnerving than Claude and it sticks to the request, as long as
| the request is not too complex.
|
| My conclusion: I like it, and totally see where it shines, narrow
| targeted work, adding to Claude 3.7 - for creative work, and
| Gemini 2.5 Pro for deep complex tasks. GPT-4.1 does feel like a
| smaller model compared to these last two, but maybe I just need
| to use it for longer.
|
| 0: https://cookbook.openai.com/examples/gpt4-1_prompting_guide
| ttul wrote:
| I feel the same way about these models as you conclude. Gemini
| 2.5 is where I paste whole projects for major refactoring
| efforts or building big new bits of functionality. Claude 3.7
| is great for most day to day edits. And 4.1 okay for small
| things.
|
| I hope they release a distillation of 4.5 that uses the same
| training approach; that might be a pretty decent model.
| bli940505 wrote:
| Does this mean that the o1 and o3-mini models are also using 4.1
| as the base now?
___________________________________________________________________
(page generated 2025-04-14 23:00 UTC)