[HN Gopher] GPT-4.1 in the API
       ___________________________________________________________________
        
       GPT-4.1 in the API
        
       Author : maheshrijal
       Score  : 403 points
       Date   : 2025-04-14 17:01 UTC (5 hours ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | porphyra wrote:
       | pretty wild versioning that GPT 4.1 is newer and better in many
       | regards than GPT 4.5.
        
         | mhh__ wrote:
         | I think they're doing it deliberately at this point
        
           | hmottestad wrote:
           | Tomorrow they are releasing the open source GPT-1.4 model :P
        
         | asdev wrote:
         | it's worse on nearly every benchmark
        
           | brokensegue wrote:
           | no? it's better on AIME '24, Multilingual MMLU, SWE-bench,
           | Aider's polyglot, MMMU, ComplexFuncBench
           | 
           | and it ties on a lot of benchmarks
        
             | asdev wrote:
             | look at all the graphs in the article
        
               | brokensegue wrote:
               | the data i posted all came from the graphs/charts in the
               | article
        
       | exizt88 wrote:
       | For conversational AI, the most significant part is GPT-4.1 mini
       | being 2x faster than GPT-4o at basically the same reasoning
       | capabilities.
        
       | bakugo wrote:
       | > We will also begin deprecating GPT-4.5 Preview in the API, as
       | GPT-4.1 offers improved or similar performance on many key
       | capabilities at much lower cost and latency. GPT-4.5 Preview will
       | be turned off in three months, on July 14, 2025, to allow time
       | for developers to transition.
       | 
       | Well, that didn't last long.
        
         | WorldPeas wrote:
         | so we're going back... .4 of a gpt? make it make sense openai..
        
       | elias_t wrote:
       | Does someone have the benchmarks compared to other models?
        
         | cbg0 wrote:
         | claude 3.7 no thinking (diff) - 60.4%
         | 
         | claude 3.7 32k thinking tokens (diff) - 64.9%
         | 
         | GPT-4.1 (diff) - 52.9% (stat is from the blog post)
         | 
         | https://aider.chat/docs/leaderboards/
        
       | oidar wrote:
       | I need an AI to understand the naming conventions that OpenAI is
       | using.
        
         | fusionadvocate wrote:
         | They envy the USB committee.
        
       | ZeroCool2u wrote:
       | No benchmark comparisons to other models, especially Gemini 2.5
       | Pro, is telling.
        
         | dmd wrote:
         | Gemini 2.5 Pro gets 64% on SWE-bench verified. Sonnet 3.7 gets
         | 70%
         | 
         | They are reporting that GPT-4.1 gets 55%.
        
           | hmottestad wrote:
           | Are those with <<thinking>> or without?
        
             | energy123 wrote:
             | With
        
             | chaos_emergent wrote:
             | based on their release cadence, I suspect that o4-mini will
             | compete on price, performance, and context length with the
             | rest of these models.
        
               | hecticjeff wrote:
               | o4-mini, not to be confused with 4o-mini
        
             | sanxiyn wrote:
             | Sonnet 3.7's 70% is without thinking, see
             | https://www.anthropic.com/news/claude-3-7-sonnet
        
           | egeozcan wrote:
           | Very interesting. For my use cases, Gemini's responses beat
           | Sonnet 3.7's like 80% of the time (gut feeling, didn't
           | collect actual data). It beats Sonnet 100% of the time when
           | the context gets above 120k.
        
             | int_19h wrote:
             | As usual with LLMs. In my experience, all those metrics are
             | useful mainly to tell which models are definitely bad, but
             | doesn't tell you much about which ones are good, and
             | especially not how the good ones stack against each other
             | in real world use cases.
             | 
             | Andrej Karpathy famously quipped that he only trusts two
             | LLM evals: Chatbot Arena (which has humans blindly compare
             | and score responses), and the r/LocalLLaMA comment section.
        
         | poormathskills wrote:
         | Go look at their past blog posts. OpenAI only ever benchmarks
         | against their own models.
         | 
         | This is pretty common across industries. The leader doesn't
         | compare themselves to the competition.
        
           | dimitrios1 wrote:
           | There is no uniform tactic for this type of marketing. They
           | will compare against whomever they need to to suit their
           | marketing goals.
        
           | oofbaroomf wrote:
           | Leader is debatable, especially given the actual
           | comparisons...
        
           | swyx wrote:
           | also sometimes if you get it wrong you catch unnecessary flak
        
           | kweingar wrote:
           | That would make sense if OAI were the leader.
        
           | christianqchung wrote:
           | Okay, it's common across other industries, but not this one.
           | Here is Google, Facebook, and Anthropic comparing their
           | frontier models to others[1][2][3].
           | 
           | [1] https://blog.google/technology/google-deepmind/gemini-
           | model-...
           | 
           | [2] https://ai.meta.com/blog/llama-4-multimodal-intelligence/
           | 
           | [3] https://www.anthropic.com/claude/sonnet
        
             | poormathskills wrote:
             | Right. Those labs aren't leading the industry.
        
           | awestroke wrote:
           | Except they are far from the lead in model performance
        
             | poormathskills wrote:
             | Who has a (publicly released) model that is SOTA is
             | constantly changing. It's more interesting to see who is
             | driving the innovation in the field, and right now that is
             | pretty clearly OpenAI (GPT-3, first multi-modal model,
             | first reasoning model, ect).
        
       | codingwagie wrote:
       | GPT-4.1 probably is a distilled version of GPT-4.5
       | 
       | I dont understand the constant complaining about naming
       | conventions. The number system differentiates the models based on
       | capability, any other method would not do that. After ten models
       | with random names like "gemini", "nebula" you would have no idea
       | which is which. Its a low IQ take. You dont name new versions of
       | software as completely different software
       | 
       | Also, Yesterday, using v0, I replicated a full nextjs UI copying
       | a major saas player. No backend integration, but the design and
       | UX were stunning, and better than I could do if I tried. I have
       | 15 years of backend experience at FAANG. Software will get
       | automated, and it already is, people just havent figured it out
       | yet
        
         | rvz wrote:
         | > Yesterday, using v0, I replicated a full nextjs UI copying a
         | major saas player. No backend integration, but the design and
         | UX were stunning, and better than I could do if I tried.
         | 
         | Exactly. Those who do frontend or focus on pretty much anything
         | Javascript are, how should I say it? Cooked?
         | 
         | > Software will get automated
         | 
         | The first to go are those that use JavaScript / TypeScript
         | engineers have already been automated out of a job. It is all
         | over for them.
        
           | codingwagie wrote:
           | Yeah its over for them. Complicated business logic and
           | sprawling systems are what are keeping backend safe for now.
           | But the big front end code bases where individual files (like
           | react components) are largely decoupled from the rest of the
           | code base is why front end is completely cooked
        
           | camdenreslink wrote:
           | I have a medium-sized typescript personal project I work on.
           | It probably has 20k LOC of well organized typescript (react
           | frontend, express backend). I also have somewhat
           | comprehensive docs and cursor project rules.
           | 
           | In general I use Cursor in manual mode asking it to make very
           | well scoped small changes (e.g. "write this function that
           | does this in this exact spot"). Yesterday I needed to make a
           | largely mechanical change (change a concept in the front end,
           | make updates to the corresponding endpoints, update the data
           | access methods, update the database schema).
           | 
           | This is something very easy I would expect a junior developer
           | to be able to accomplish. It is simple, largely mechanical,
           | but touches a lot of files. Cursor agent mode puked all over
           | itself using Gemini 2.5. It could summarize what changes
           | would need to be made, but it was totally incapable of making
           | the changes. It would add weird hard coded conditions, define
           | new unrelated files, not follow the conventions of the
           | surrounding code at all.
           | 
           | TLDR; I think LLMs right now are good for greenfield
           | development (create this front end from scratch following
           | common patterns), and small scoped changes to a few files. If
           | you have any kind of medium sized refactor on an existing
           | code base forget about it.
        
             | codingwagie wrote:
             | My personal opinion is leveraging LLMs on a large code base
             | requires skill. How you construct the prompt, and what you
             | keep in context, which model you use, all have a large
             | effect on the output. If you just put it into cursor and
             | throw your hands up, you probably didnt do it right
        
               | camdenreslink wrote:
               | I gave it a list of the changes I needed and pointed it
               | to the area of the different files that needed updated. I
               | also have comprehensive cursor project rules. If I needed
               | to hand hold any more than that it would take
               | considerably less time to just make the changes myself.
        
             | Philpax wrote:
             | > Cursor agent mode puked all over itself using Gemini 2.5.
             | It could summarize what changes would need to be made, but
             | it was totally incapable of making the changes.
             | 
             | Gemini 2.5 is currently broken with the Cursor agent; it
             | doesn't seem to be able to issue tool calls correctly. I've
             | been using Gemini to write plans, which Claude then
             | executes, and this seems to work well as a workaround.
             | Still unfortunate that it's like this, though.
        
               | camdenreslink wrote:
               | Interesting, I've found Gemini better than Claude so I
               | defaulted to that. I'll try another refactor in agent
               | mode with Claude.
        
         | jsheard wrote:
         | > using v0, I replicated a full nextjs UI copying a major saas
         | player. No backend integration, but the design and UX were
         | stunning
         | 
         | AI is amazing, now all you need to create a stunning UI is for
         | someone else to make it first so an AI can rip it off. Not
         | beating the "plagiarism machine" allegations here.
        
           | codingwagie wrote:
           | Heres a secret: Most of the highest funded VC backed software
           | companies are just copying a competitor with a slight product
           | spin/different pricing model
        
             | umanwizard wrote:
             | Got any examples?
        
               | codingwagie wrote:
               | Rippling
        
             | florakel wrote:
             | Exactly, they like to call it "bringing new energy to an
             | old industry".
        
             | singron wrote:
             | > Jim Barksdale, used to say there's only two ways to make
             | money in business: One is to bundle; the other is unbundle
             | 
             | https://a16z.com/the-future-of-work-cars-and-the-wisdom-
             | in-s...
        
         | Philpax wrote:
         | > The number system differentiates the models based on
         | capability, any other method would not do that.
         | 
         | Please rank GPT-4, GPT-4 Turbo, GPT-4o, GPT-4.1-nano,
         | GPT-4.1-mini, GPT-4.1, GPT-4.5, o1-mini, o1, o1 pro, o3-mini,
         | o3-mini-high, o3, and o4-mini in terms of capability without
         | consulting any documentation.
        
           | codingwagie wrote:
           | Very easy with the naming system?
        
             | bobxmax wrote:
             | Really? Is o3-mini-high better than o1-pro?
        
               | vbezhenar wrote:
               | In my experience it's better for value/price, but if you
               | just need to solve a problem, o1 pro is the best tool
               | available.
        
           | umanwizard wrote:
           | Btw, as someone who agrees with your point, what's the actual
           | answer to this?
        
             | henlobenlo wrote:
             | Whats the problem, for the layman it doesnt actually
             | matter, and for the experts, its usually very obvious which
             | model to use.
        
               | umanwizard wrote:
               | That's not true. I'm a layman and 4.5 is obviously better
               | than 4o for me, definitely enough to matter.
        
               | henlobenlo wrote:
               | You are definitely not a layman if you know the
               | difference between 4.5 and 4o. The average user thinks ai
               | = openai = chatgpt.
        
               | umanwizard wrote:
               | Well, okay, but I'm certainly not an expert who knows the
               | fine differences between all the models available on
               | chat.com. So I'm somewhere between your definition of
               | "layman" and your definition of "expert" (as are, I
               | suspect, most people on this forum).
        
               | DiscourseFan wrote:
               | LLMs fundamentally have the same contraints no matter how
               | much juice you give them or how much you toy with the
               | models.
        
             | minimaxir wrote:
             | It depends on how you define "capability" since that's
             | different for reasoning and nonreasoning models.
        
             | n2d4 wrote:
             | Of these, some are mostly obsolete: GPT-4 and GPT-4 Turbo
             | are worse than GPT-4o in both speed and capabilities. o1 is
             | worse than o3-mini-high in most aspects.
             | 
             | Then, some are not available yet: o3 and o4-mini. GPT-4.1 I
             | haven't played with enough to give you my opinion on.
             | 
             | Among the rest, it depends on what you're looking for:
             | 
             | Multi-modal: GPT-4o > everything else
             | 
             | Reasoning: o1-pro > o3-mini-high > o3-mini
             | 
             | Speed: GPT-4o > o3-mini > o3-mini-high > o1-pro
             | 
             | (My personal favorite is o3-mini-high for most things, as
             | it has a good tradeoff between speed and reasoning.
             | Although I use 4o for simpler queries.)
        
               | Y_Y wrote:
               | So where was o1-pro in the comparisons in OpenAI's
               | article? I just don't trust any of these first party
               | benchmarks any more.
        
               | umanwizard wrote:
               | Is 4.5 not strictly better than 4o?
        
           | zeroxfe wrote:
           | There's no single ordering -- it really depends on what
           | you're trying to do, how long you're willing to wait, and
           | what kinds of modalities you're interested in.
        
           | chaos_emergent wrote:
           | I meant this is actually straight-forward if you've been
           | paying even the remotest of attention.
           | 
           | Chronologically:
           | 
           | GPT-4, GPT-4 Turbo, GPT-4o, o1-preview/o1-mini,
           | o1/o3-mini/o3-mini-high/o1-pro, gpt-4.5, gpt-4.1
           | 
           | Model iterations, by training paradigm:
           | 
           | SGD pretraining with RLHF: GPT-4 -> turbo -> 4o
           | 
           | SGD pretraining w/ RL on verifiable tasks to improve
           | reasoning ability: o1-preview/o1-mini -> o1/o3-mini/o3-mini-
           | high (technically the same product with a higher reasoning
           | token budget) -> o3/o4-mini (not yet released)
           | 
           | reasoning model with some sort of Monte Carlo Search
           | algorithm on top of reasoning traces: o1-pro
           | 
           | Some sort of training pipeline that does well with sparser
           | data, but doesn't incorporate reasoning (I'm positing here,
           | training and architecture paradigms are not that clear for
           | this generation): gpt-4.5, gpt-4.1 (likely fine-tuned on 4.5)
           | 
           | By performance: hard to tell! Depends on what your task is,
           | just like with humans. There are plenty of benchmarks.
           | Roughly, for me, the top 3 by task are:
           | 
           | Creative Writing: gpt-4.5 -> gpt-4o
           | 
           | Business Comms: o1-pro -> o1 -> o3-mini
           | 
           | Coding: o1-pro -> o3-mini (high) -> o1 -> o3-mini (low) ->
           | o1-mini-preview
           | 
           | Shooting the shit: gpt-4o -> o1
           | 
           | It's not to dismiss that their marketing nomenclature is bad,
           | just to point out that it's not that confusing for people
           | that are actively working with these models have are a
           | reasonable memory of the past two years.
        
           | newfocogi wrote:
           | I recognize this is a somewhat rhetorical question and your
           | point is well taken. But something that maps well is car
           | makes and models:
           | 
           | - Is Ford Better than Chevy? (Comparison across providers) It
           | depends on what you value, but I guarantee there's tribes
           | that are sure there's only one answer.
           | 
           | - Is the 6th gen 2025 4Runner better than 5th gen 2024
           | 4Runner? (Comparison of same model across new releases) It
           | depends on what you value. It is a clear iteration on the
           | technology, but there will probably be more plastic parts
           | that will annoy you as well.
           | 
           | - Is the 2025 BMW M3 base model better than the 2022 M3
           | Competition (Comparing across years and trims)? Starts to
           | depend even more on what you value.
           | 
           | Providers need to delineate between releases, and years,
           | models, and trims help do this. There are companies that will
           | try to eschew this and go the Tesla route without models
           | years, but still can't get away from it entirely. To a
           | certain person, every character in "2025 M3 Competition
           | xDrive Sedan" matters immensely, to another person its just
           | gibberish.
           | 
           | But a pure ranking isn't the point.
        
           | mrandish wrote:
           | Yes, point taken.
           | 
           | However, it's _still_ not as bad as Intel CPU naming in some
           | generations or USB naming (until very recently). I know, that
           | 's a _very_ low bar... :-)
        
         | tomrod wrote:
         | Just add SemVer with an extra tag:
         | 
         | 4.0.5.worsethan4point5
        
         | whalesalad wrote:
         | > I don't understand the constant complaining about naming
         | conventions.
         | 
         | Oh man. Unfolding my lawn chair and grabbing a bucket of
         | popcorn for this discussion.
        
         | latexr wrote:
         | > You dont name new versions of software as completely
         | different software
         | 
         | macOS releases would like a word with you.
         | 
         | https://en.wikipedia.org/wiki/MacOS#Timeline_of_releases
         | 
         | Technically they still have numbers, but Apple hides them in
         | marketing copy.
         | 
         | https://www.apple.com/macos/
         | 
         | Though they still have "macOS" in the name. I'm being tongue-
         | in-cheek.
        
         | SubiculumCode wrote:
         | Feel free to lay the naming convention rules out for us man.
        
         | throw1235435 wrote:
         | > Software will get automated, and it already is, people just
         | havent figured it out yet
         | 
         | To be honest I think this is most AI labs (particularly the
         | American ones) not-so-secret goal now, for a number of strong
         | reasons. You can see it in this announcements, Anthrophic's
         | recent Claude 3.7 announcement, OpenAI's first planned agent
         | (SWE-Agent), etc etc. They have to justify their worth somehow
         | and they see it as a potential path to do that. Remains to be
         | seen how far they will get - I hope I'm wrong.
         | 
         | The reasons however for picking this path IMO are:
         | 
         | - Their usage statistics show coding as the main user:
         | Anthrophic recently released their stats. Its become the main
         | usage of these models, with other usages at best being novelty
         | or conveniences for people in relative size. Without this
         | market IMO the hype would of already fizzled awhile ago at best
         | a novelty when looking at the rest of the user base size.
         | 
         | - They "smell blood" to disrupt and fear is very effective to
         | promote their product: This IMO is the biggest one. Disrupting
         | software looks to be an achievable goal, but it also is a goal
         | that has high engagement compared to other use cases. No point
         | solving something awesome if people don't care, or only care
         | for awhile (e.g. meme image generation). You can see the
         | developers on this site and elsewhere in fear. Fear is the best
         | marketing tool ever and engagement can last years. It keeps
         | people engaged and wanting to know more; and talking about how
         | "they are cooked" almost to the exclusion of everything else
         | (i.e. focusing on the threat). Nothing motivates you to know a
         | product more than not being able to provide for yourself, your
         | family, etc to the point that most other tech
         | topics/innovations are being drowned out by AI announcements.
         | 
         | - Many of them are losing money and need a market to disrupt:
         | Currently the existing use cases of a chat bot are not yet
         | impressive enough (or haven't been till very recently) to
         | justify the massive valuations of these companies. Its coding
         | that is allowing them to bootstrap into other domains.
         | 
         | - It is a domain they understand: AI dev's know models, they
         | understand the software process. It may be a complex domain
         | requiring constant study, but they know it back to front. This
         | makes it a good first case for disruption where the data, and
         | the know how is already with the teams.
         | 
         | TL;DR: They are coming after you, because it is a big fruit
         | that is easier to pick for them than other domains. Its also
         | one that people will notice either out of excitement (CEO,
         | VC's, Management, etc) or out of fear (tech workers, academics,
         | other intellectual workers).
        
       | rvz wrote:
       | The big change about this announcement is the 1M context window
       | on all models.
       | 
       | But the _price_ is what matters.
        
         | croemer wrote:
         | Nothing compared to Llama 4's 7M. What matters is how well it
         | performs with such long context, not what the technical maximum
         | is.
        
       | polytely wrote:
       | It seems that OpenAI is really differentiating itself in the AI
       | market by developing the most incomprehensible product names in
       | the history of software.
        
         | croes wrote:
         | They learned from the best: Microsoft
        
           | pixl97 wrote:
           | "Hey buddy, want some .Net, oh I mean dotnet"
        
           | nivertech wrote:
           | GPT 4 Workgroups
        
           | amarcheschi wrote:
           | GpTeams Classic
        
           | greenavocado wrote:
           | Microsoft Neural Language Processing Hyperscale Datacenter
           | Enterprise Edition 4.1
           | 
           | A massive transformer-based language model requiring:
           | 
           | - 128 Xeon server-grade CPUs
           | 
           | - 25,000MB RAM minimum (40,000MB recommended)
           | 
           | - 80GB hard disk space for model weights
           | 
           | - Dedicated NVIDIA Quantum Accelerator Cards (minimum 8)
           | 
           | - Enterprise-grade cooling solution
           | 
           | - Dedicated 30-amp power circuit
           | 
           | - Windows NT Advanced Server with Parallel Processing
           | Extensions
           | 
           | ~
           | 
           | Features:
           | 
           | - Natural language understanding and generation
           | 
           | - Context window of 8,192 tokens
           | 
           | - Enterprise security compliance module
           | 
           | - Custom prompt engineering interface
           | 
           | - API gateway for third-party applications
           | 
           | *Includes 24/7 on-call Microsoft support team and requires
           | dedicated server room with raised floor cooling
        
           | jmount wrote:
           | Or Intel.
        
         | jfoster wrote:
         | I wonder how they decide whether the o or the digit needs to
         | come first. (eg. o3 vs 4o)
        
           | oofbaroomf wrote:
           | Reasoning models have the o first, non-reasoners have the
           | digit first.
        
       | yberreby wrote:
       | > Note that GPT-4.1 will only be available via the API. In
       | ChatGPT, many of the improvements in instruction following,
       | coding, and intelligence have been gradually incorporated into
       | the latest version (opens in a new window) of GPT-4o, and we will
       | continue to incorporate more with future releases.
       | 
       | The lack of availability in ChatGPT is disappointing, and they're
       | playing on ambiguity here. They are framing this as if it were
       | unnecessary to release 4.1 on ChatGPT, since 4o is apparently
       | great, while simultaneously showing how much better 4.1 is
       | relative to GPT-4o.
       | 
       | One wager is that the inference cost is significantly higher for
       | 4.1 than for 4o, and that they expect most ChatGPT users not to
       | notice a marginal difference in output quality. API users,
       | however, will notice. Alternatively, 4o might have been
       | aggressively tuned to be conversational while 4.1 is more
       | "neutral"? I wonder.
        
         | themanmaran wrote:
         | I disagree. From the average user perspective, it's quite
         | confusing to see half a dozen models to choose from in the UI.
         | In an ideal world, ChatGPT would just abstract away the
         | decision. So I don't need to be an expert in the relatively
         | minor differences between each model to have a good experience.
         | 
         | Vs in the API, I want to have very strict versioning of the
         | models I'm using. And so letting me run by own evals and pick
         | the model that works best.
        
           | florakel wrote:
           | > it's quite confusing to see half a dozen models to choose
           | from in the UI. In an ideal world, ChatGPT would just
           | abstract away the decision
           | 
           | Supposedly that's coming with GPT 5.
        
           | yberreby wrote:
           | I agree on both naming on stability. However, this wasn't my
           | point.
           | 
           | They still have a mess of models in ChatGPT for now, and it
           | doesn't look like this is going to get better immediately
           | (even though for GPT-5, they ostensibly want to unify them).
           | You have to choose among all of them anyway.
           | 
           | I'd like to be able to choose 4.1.
        
         | Tiberium wrote:
         | There's a HUGE difference that you are not mentioning: there
         | are "gpt-4o" and "chatgpt-4o-latest" on the API. The former is
         | the stable version (there are a few snapshot but the newest
         | snapshot has been there for a while), and the latter is the
         | fine-tuned version that they often update on ChatGPT. All those
         | benchmarks were done for the _API_ stable version of GPT-4o,
         | since that 's what businesses rely on, not on
         | "chatgpt-4o-latest".
        
           | yberreby wrote:
           | Good point, but how does that relate to, or explain, the
           | decision not to release 4.1 in ChatGPT? If they have a nice
           | post-training pipeline to make 4o "nicer" to talk to, why not
           | use it to fine-tune the base 4.1 into e.g.
           | chatgpt-4.1-latest?
        
             | Tiberium wrote:
             | Because chatgpt-4o-latest already has all of those
             | improvements, the largest point of this release (IMO) is to
             | offer developers a stable snapshot of something that
             | compares to modern 4o latest. Altman said that they'd offer
             | a stable snapshot of chatgpt 4o latest on the API, he
             | perhaps did really mean GPT 4.1.
        
               | yberreby wrote:
               | > Because chatgpt-4o-latest already has all of those
               | improvements
               | 
               | Does it, though? They said that "many" have already been
               | incorporated. I simply don't buy their vague statements
               | there. These are different models. They may share some
               | training/post-training recipe improvements, but they are
               | still different.
        
       | meetpateltech wrote:
       | GPT-4.1 Pricing (per 1M tokens):
       | 
       | gpt-4.1
       | 
       | - Input: $2.00
       | 
       | - Cached Input: $0.50
       | 
       | - Output: $8.00
       | 
       | gpt-4.1-mini
       | 
       | - Input: $0.40
       | 
       | - Cached Input: $0.10
       | 
       | - Output: $1.60
       | 
       | gpt-4.1-nano
       | 
       | - Input: $0.10
       | 
       | - Cached Input: $0.025
       | 
       | - Output: $0.40
        
         | minimaxir wrote:
         | The cached input price is notable here: previously with GPT-4o
         | it was 1/2 the cost of raw input, now it's 1/4th.
         | 
         | It's still not as notable as Claude's 1/10th the cost of raw
         | input, but it shows OpenAI's making improvements in this area.
        
           | persedes wrote:
           | Unless that has changed, anthropics (and gemini) caches are
           | opt-in though if I recall, openai automatically chaches for
           | you.
        
         | glenstein wrote:
         | Awesome, thank you for posting. As someone who regularly uses
         | 4o mini from the API, any guesses or intuitions about the
         | performance of Nano?
         | 
         | I'm not as concerned about nomenclature as other people, which
         | I think is too often reacting to a headline as opposed to the
         | article. But in this case, I'm not sure if I'm supposed to
         | understand nano as categorically different than many in terms
         | of what it means as a variation from a core model.
        
           | pzo wrote:
           | they share in livestream that 4.1-nano is worse than 4o-mini
           | - so nano is cheaper, faster and have bigger context but
           | worse in intelligence. 4.1mini is smarter but there is price
           | increase.
        
         | twistslider wrote:
         | The fact that they're raising the price for the mini models by
         | 166% is pretty notable.
         | 
         | gpt-4o-mini for comparison:
         | 
         | - Input: $0.15
         | 
         | - Cached Input $0.075
         | 
         | - Output: $0.60
        
           | conradkay wrote:
           | Seems like 4.1 nano ($0.10) is closer to the replacement and
           | 4.1 mini is a new in-between price
        
           | druskacik wrote:
           | That's what I was thinking. I hoped to see a price drop, but
           | this does not change anything for my use cases.
           | 
           | I was using gpt-4o-mini with batch API, which I recently
           | replaced with mistral-small-latest batch API, which costs
           | $0.10/$0.30 (or $0.05/$0.15 when using the batch API). I may
           | change to 4.1-nano, but I'd have to be overwhelmed by its
           | performance in comparision to mistral.
        
           | glenstein wrote:
           | I don't think they ever committed themselves to uniformed
           | pricing for mini models. Of course cheaper is better but I
           | understand pricing to be contingent on factors specific to
           | every next model rather than following from a blanket policy.
        
       | minimaxir wrote:
       | It's not the point of the announcement, but I do like the use of
       | the (abs) subscript to demonstrate the improvement in LLM
       | performance since in these types of benchmark descriptions I
       | never can tell if the percentage increase is absolute or
       | relative.
        
       | croemer wrote:
       | Testing against unspecified other "leading" models allows for
       | shenanigangs:
       | 
       | > Qodo tested GPT-4.1 head-to-head against other leading models
       | [...] they found that GPT-4.1 produced the better suggestion in
       | 55% of cases
       | 
       | The linked blog post goes 404:
       | https://www.qodo.ai/blog/benchmarked-gpt-4-1/
        
         | gs17 wrote:
         | The post seems to be up now and seems to compare it slightly
         | favorable to Claude 3.7.
        
           | croemer wrote:
           | Right, now it's up and comparison against Claude 3.7 is
           | better than I feared based on the wording. Though why does
           | the OpenAI announcement talk of comparison against multiple
           | leading models when the Qodo blog post only tests against
           | Claude 3.7...
        
       | runako wrote:
       | ChatGPT currently recommends I use o3-mini-high ("great at coding
       | and logic") when I start a code conversation with 4o.
       | 
       | I don't understand why the comparison in the announcement talks
       | so much about comparing with 4o's coding abilities to 4.1.
       | Wouldn't the relevant comparison be to o3-mini-high?
       | 
       | 4.1 costs a lot more than o3-mini-high, so this seems like a
       | pertinent thing for them to have addressed here. Maybe I am
       | misunderstanding the relationship between the models?
        
         | zamadatix wrote:
         | 4.1 is a pinned API variant with the improvements from the
         | newer iterations of 4o you're already using in the app, so
         | that's why the comparison focuses between those two.
         | 
         | Pricing wise the per token cost of o3-mini is less than 4.1 but
         | keep in mind o3-mini is a reasoning model and you will pay for
         | those tokens too, not just the final output tokens. Also be
         | aware reasoning models can take a long time to return a
         | response... which isn't great if you're trying to use an API
         | for interactive coding.
        
         | ac29 wrote:
         | > I don't understand why the comparison in the announcement
         | talks so much about comparing with 4o's coding abilities to
         | 4.1. Wouldn't the relevant comparison be to o3-mini-high?
         | 
         | There are tons of comparisons to o3-mini-high in the linked
         | article.
        
       | Tiberium wrote:
       | Very important note:
       | 
       | >Note that GPT-4.1 will only be available via the API. In
       | ChatGPT, many of the improvements in instruction following,
       | coding, and intelligence have been gradually incorporated into
       | the latest version
       | 
       | If anyone here doesn't know, OpenAI _does_ offer the ChatGPT
       | model version in the API as chatgpt-4o-latest, but it 's bad
       | because they continuously update it so businesses can't reliably
       | rely on it being stable, that's why OpenAI made GPT 4.1.
        
         | croemer wrote:
         | So you're saying that "ChatGPT-4o-latest (2025-03-26)" in
         | LMarena is 4.1?
        
           | granzymes wrote:
           | No, that is saying that some of the improvements that went
           | into 4.1 have also gone into ChatGPT, including
           | chatgpt-4o-latest (2025-03-26).
        
           | pzo wrote:
           | yeah I was surprised in they benchmarks during livestream
           | they didn't compare to ChatGPT-4o (2025-03-26) but only older
           | one.
        
         | exizt88 wrote:
         | > chatgpt-4o-latest, but it's bad because they continuously
         | update it
         | 
         | Version explicitly marked as "latest" being continuously
         | updated it? Crazy.
        
           | sbarre wrote:
           | No one's arguing that it's improperly labelled, but if you're
           | going to use it via API, you _might_ want consistency over
           | bleeding edge.
        
           | IanCal wrote:
           | Lots of the other models are checkpoint releases, and latest
           | is a pointer to the latest checkpoint. Something being
           | continuously updated is quite different and worth knowing
           | about.
        
           | rfw300 wrote:
           | It can be both properly communicated and still bad for API
           | use cases.
        
         | minimaxir wrote:
         | OpenAI (and most LLM providers) allow model version pinning for
         | exactly this reason, e.g. in the case of GPT-4o you can specify
         | gpt-4o-2024-05-13, gpt-4o-2024-08-06, or gpt-4o-2024-11-20.
         | 
         | https://platform.openai.com/docs/models/gpt-4o
        
           | Tiberium wrote:
           | Yes, and they don't make snapshots for chatgpt-4o-latest, but
           | they made them for GPT 4.1, that's why 4.1 is only useful for
           | API, since their ChatGPT product already has the better
           | model.
        
             | cootsnuck wrote:
             | Okay so is GPT 4.1 literally just the current
             | chatpt-4o-latest or not?
        
         | ilaksh wrote:
         | Yeah, in the last week, I had seen a strong benchmark for
         | chatgpt-4o-latest and tried it for a client's use case. I ended
         | up wasting like 4 days, because after my initial strong test
         | results, in the following days, it gave results that were
         | inconsistent and poor, and sometimes just outputting spaces.
        
       | flakiness wrote:
       | Big focus on coding. It feels like a defensive move against
       | Claude (and more recently, Gemini Pro) which became very popular
       | in that regime. I guess they recently figured out some ways to
       | train the model for these "agentic" coding through RL or
       | something - and the finding is too new to apply 4.5 on time.
        
       | modeless wrote:
       | Numbers for SWE-bench Verified, Aider Polyglot, cost per million
       | output tokens, output tokens per second, and knowledge cutoff
       | month/year:                            SWE  Aider Cost Fast Fresh
       | Claude 3.7  70%  65%   $15  77   8/24      Gemini 2.5  64%  69%
       | $10  200  1/25      GPT-4.1     55%  53%   $8   169  6/24
       | DeepSeek R1 49%  57%   $2.2 22   7/24      Grok 3 Beta ?    53%
       | $15  ?    11/24
       | 
       | I'm not sure this is really an apples-to-apples comparison as it
       | may involve different test scaffolding and levels of "thinking".
       | Tokens per second numbers are from here:
       | https://artificialanalysis.ai/models/gpt-4o-chatgpt-03-25/pr...
       | and I'm assuming 4.1 is the speed of 4o given the "latency" graph
       | in the article putting them at the same latency.
       | 
       | Is it available in Cursor yet?
        
         | meetpateltech wrote:
         | Yes, it is available in Cursor[1] and Windsurf[2] as well.
         | 
         | [1] https://twitter.com/cursor_ai/status/1911835651810738406
         | 
         | [2] https://twitter.com/windsurf_ai/status/1911833698825286142
        
           | cellwebb wrote:
           | And free on windsurf for a week! Vibe time.
        
         | tomjen3 wrote:
         | Its available for free in Windsurf so you can try it out there.
         | 
         | Edit: Now also in Cursor
        
         | jsnell wrote:
         | https://aider.chat/docs/leaderboards/ shows 73% rather than 69%
         | for Gemini 2.5 Pro?
         | 
         | Looks like they also added the cost of the benchmark run to the
         | leaderboard, which is quite cool. Cost per output token is no
         | longer representative of the actual cost when the number of
         | tokens can vary by an order of magnitude for the same problem
         | just based on how many thinking tokens the model is told to
         | use.
        
           | modeless wrote:
           | There are different scores reported by Google for "diff" and
           | "whole" modes, and the others were "diff" so I chose the
           | "diff" score. Hard to make a real apples-to-apples
           | comparison.
        
             | jsnell wrote:
             | The 73% on the current leaderboard is using "diff", not
             | "whole". (Well, diff-fenced, but the difference is just the
             | location of the filename.)
        
               | modeless wrote:
               | Huh, seems like Aider made a special mode specifically
               | for Gemini[1] some time after Google's announcement blog
               | post with official performance numbers. Still not sure it
               | makes sense to quote that new score next to the others.
               | In any case Gemini's 69% is the top score even without a
               | special mode.
               | 
               | [1] https://aider.chat/docs/more/edit-formats.html#diff-
               | fenced:~...
        
               | jsnell wrote:
               | The mode wasn't added after the announcement, Aider has
               | had it for almost a year:
               | https://aider.chat/HISTORY.html#aider-v0320
               | 
               | This benchmark has an authoritative source of results
               | (the leaderboard), so it seems obvious that it's the
               | number that should be used.
        
               | modeless wrote:
               | OK but it was still added specifically to improve Gemini
               | and nobody else on the leaderboard uses it. Google
               | themselves do not use it when they benchmark their own
               | models against others. They use the regular diff mode
               | that everyone else uses.
               | https://blog.google/technology/google-deepmind/gemini-
               | model-...
        
             | tcdent wrote:
             | They just pick the best performer out of the built-in modes
             | they offer.
             | 
             | Interesting data point about the models behavior, but even
             | moreso it's a recommendation of which way to configure the
             | model for optimal performance.
             | 
             | I do consider this to be an apple-to-apples benchmark since
             | they're evaluating real world performance.
        
           | anotherpaulg wrote:
           | Aider author here.
           | 
           | Based on some DMs with the Gemini team, they weren't aware
           | that aider supports a "diff-fenced" edit format. And that it
           | is specifically tuned to work well with Gemini models. So
           | they didn't think to try it when they ran the aider
           | benchmarks internally.
           | 
           | Beyond that, I spend significant energy tuning aider to work
           | well with top models. That is in fact the entire reason for
           | aider's benchmark suite: to quantitatively measure and
           | improve how well aider works with LLMs.
           | 
           | Aider makes various adjustments to how it prompts and
           | interacts with most every top model, to provide the very best
           | possible AI coding results.
        
         | soheil wrote:
         | Yes on both Cursor and Windsurf.
         | 
         | https://twitter.com/cursor_ai/status/1911835651810738406
        
       | msp26 wrote:
       | I was hoping for native image gen in the API but better pricing
       | is always appreciated.
       | 
       | Gemini was drastically cheaper for image/video analysis, I'll
       | have to see how 4.1 mini and nano compare.
        
       | oofbaroomf wrote:
       | I'm not really bullish on OpenAI. Why would they only compare
       | with their own models? The only explanation could be that they
       | aren't as competitive with other labs as they were before.
        
         | greenavocado wrote:
         | See figure 1 for up-to-date benchmarks
         | https://github.com/KCORES/kcores-llm-arena
         | 
         | (Direct Link) https://raw.githubusercontent.com/KCORES/kcores-
         | llm-arena/re...
        
         | poormathskills wrote:
         | Go look at their past blog posts. OpenAI only ever benchmarks
         | against their own models.
        
           | oofbaroomf wrote:
           | Oh, ok. But it's still quite telling of their attitude as an
           | organization.
        
             | rvnx wrote:
             | It's the same organization that kept repeating that sharing
             | weights of GPT would be "too dangerous for the world".
             | Eventually DeepSeek thankfully did something like that,
             | though they are supposed to be the evil guys.
        
         | kcatskcolbdi wrote:
         | I don't mind what they benchmark against as long as, when I use
         | the model, it continues to give me better results than their
         | competition.
        
         | gizmodo59 wrote:
         | Apple compares against its own products most of the times.
        
       | asdev wrote:
       | it's worse than 4.5 on nearly every benchmark. just an
       | incremental improvement. AI is slowing down
        
         | conradkay wrote:
         | It's like 30x cheaper though. Probably just distilled 4.5
        
         | GaggiX wrote:
         | It's better on AIME '24, Multilingual MMLU, SWE-bench, Aider's
         | polyglot, MMMU, ComplexFuncBench while being much much cheaper
         | and smaller.
        
           | asdev wrote:
           | and it's worse on just as many benchmarks by a significant
           | amount. as a consumer I don't care about cheapness, I want
           | the maximum accuracy and performance
        
             | GaggiX wrote:
             | As a consumer you care about speed tho, and GPT-4.5 is
             | extremely slow, at this point just use a reasoning model if
             | you want the best of the best.
        
         | simianwords wrote:
         | Sorry what is the source for this?
        
         | Nckpz wrote:
         | They don't disclose parameter counts so it's hard to say
         | exactly how far apart they are in terms of size, but based on
         | the pricing it seems like a pretty wild comparison, with one
         | being an attempt at an ultra-massive SOTA model and one being a
         | model scaled down for efficiency and probably distilled from
         | the big one. The way they're presented as version numbers is
         | business nonsense which obscures a lot about what's going on.
        
         | usaar333 wrote:
         | Or OpenAI is? After using Gemini 2.5, I did not feel "AI is
         | slowing down". It's just this model isn't SOTA.
        
         | HDThoreaun wrote:
         | Maybe progress is slowing down but after using gemini 2.5 there
         | clearly is still a lot being made.
        
       | elashri wrote:
       | Are there any benchmarks or someone who did tests of performance
       | of using this long max token models in scenarios where you
       | actually use more of this token limit?
       | 
       | I found from my experience with Gemini models that after ~200k
       | that the quality drops and that it basically doesn't keep track
       | of things. But I don't have any numbers or systematic study of
       | this behavior.
       | 
       | I think all providers who announce increased max token limit
       | should address that. Because I don't think it is useful to just
       | say that max allowed tokens are 1M when you basically cannot use
       | anything near that in practice.
        
         | gymbeaux wrote:
         | I'm not optimistic. It's the Wild West and comparing models for
         | one's specific use case is difficult, essentially impossible at
         | scale.
        
         | enginoid wrote:
         | There are some benchmarks such as Fiction.LiveBench[0] that
         | give an indication and the new Graphwalks approach looks super
         | interesting.
         | 
         | But I'd love to see one specifically for "meaningful coding."
         | Coding has specific properties that are important such as
         | variable tracking (following coreference chains) described in
         | RULER[1]. This paper also cautions against Single-Needle-In-
         | The-Haystack tests which I think the OpenAI one might be. You
         | really need at least Multi-NIAH for it to tell you anything
         | meaningful, which is what they've done for the Gemini models.
         | 
         | I think something a bit more interpretable like `pass@1 rate
         | for coding turns at 128k` would so much more useful than "we
         | have 1m context" (with the acknowledgement that good-enough
         | performance is often domain dependant)
         | 
         | [0] https://fiction.live/stories/Fiction-liveBench-
         | Mar-25-2025/o...
         | 
         | [1] https://arxiv.org/pdf/2404.06654
        
         | jbentley1 wrote:
         | https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o...
         | 
         | IMO this is the best long context benchmark. Hopefully they
         | will run it for the new models soon. Needle-in-a-haystack is
         | useless at this point. Llama-4 had perfect needle in a haystack
         | results but horrible real-world-performance.
        
         | kmeisthax wrote:
         | The problem is that while you can train a model with the
         | hyperparameter of "context size" set to 1M, there's very little
         | 1M data to train on. Most of your model's ability to follow
         | long context comes from the fact that it's trained on lots of
         | (stolen) books; in fact I believe OpenAI just outright said _in
         | court_ that they can 't do long context without training on
         | books.
         | 
         | Novels are usually measured in terms of words; and there's a
         | rule of thumb that four tokens make up about three words. So
         | that 200k token wall you're hitting is right when most authors
         | stop writing. 150k is _already_ considered long for a novel,
         | and to train 1M properly, you 'd need not only a 750k book, but
         | many of them. Humans just don't write or read that much text at
         | once.
         | 
         | To get around this, whoever is training these models would need
         | to change their training strategy to either:
         | 
         | - Group books in a series together as a single, very long text
         | to be trained on
         | 
         | - Train on multiple unrelated books at once in the same context
         | window
         | 
         | - Amplify the gradients by the length of the text being trained
         | on so that the fewer long texts that do exist have greater
         | influence on the model weights as a whole.
         | 
         | I _suspect_ they 're doing #2, just to get _some_ gradients
         | onto the longer end of the context window, but that also is
         | going to diminish long-context reasoning because there 's no
         | reason for the model to develop a connection between, say,
         | token 32 and token 985,234.
        
           | nneonneo wrote:
           | I mean, can't they just train on some huge codebases? There's
           | lots of 100KLOC codebases out there which would probably get
           | close to 1M tokens.
        
           | roflmaostc wrote:
           | What about old books? Wikipedia? Law texts? Programming
           | languages documentations?
           | 
           | How many tokens is a 100 pages PDF? 10k to 100k?
        
             | arvindh-manian wrote:
             | For reference, I think a common approximation is one token
             | being 0.75 words.
             | 
             | For a 100 page book, that translates to around 50,000
             | tokens. For 1 mil+ tokens, we need to be looking at 2000+
             | page books. That's pretty rare, even for documentation.
             | 
             | It doesn't have to be text-based, though. I could see films
             | and TV shows becoming increasingly important for long-
             | context model training.
        
               | handfuloflight wrote:
               | What about the role of synthetic data?
        
               | throwup238 wrote:
               | Synthetic data requires a discriminator that can select
               | the highest quality results to feed back into training.
               | Training a discriminator is easier than a full blown LLM,
               | but it still suffers from a lack of high quality training
               | data in the case of 1M context windows. How do you train
               | a discriminator to select good 2,000 page synthetic books
               | if the only ones you have to train it with are Proust and
               | concatenated Harry Potter/Game of Thrones/etc.
        
             | jjmarr wrote:
             | Wikipedia does not have many pages that are 750k words.
             | According to Special:LongPages[1], the longest page _right
             | now_ is a little under 750k bytes.
             | 
             | https://en.wikipedia.org/wiki/List_of_chiropterans
             | 
             | Despite listing all presently known bats, the majority of
             | "list of chiropterans" byte count is code that generates
             | references to the IUCN Red List, not actual text. Most of
             | Wikipedia's longest articles are code.
             | 
             | [1] https://en.wikipedia.org/wiki/Special:LongPages
        
           | crimsoneer wrote:
           | Isn't the problem more that the "needle in a haystack" eval
           | (i said word X once, where) is really not relevant to most
           | long context LLM use cases like code, where you need the
           | context from all the stuff simultaneously rather than
           | identifying a single, quite separate relevant section?
        
           | wskish wrote:
           | codebases of high quality open source projects and their
           | major dependencies are probably another good source. also:
           | "transformative fair use", not "stolen"
        
           | omneity wrote:
           | I'm not sure to which extent this opinion is accurately
           | informed. It is well known that nobody trains on 1M token-
           | long content. It wouldn't work anyway as the dependencies are
           | too far fetched and you end up with vanishing gradients.
           | 
           | RoPE (Rotary Positional Embeddings, think modulo or periodic
           | arithmetics) scaling is key, whereby the model is trained on
           | 16k tokens long content, and then scaled up to 100k+ [0].
           | Qwen 1M (who has near perfect recall over the complete window
           | [1]) and Llama 4 10M pushed the limits of this technique,
           | with Qwen reliably training with a much higher RoPE base, and
           | Llama 4 coming up with iRoPE which claims scaling to
           | extremely long contexts up to infinity.
           | 
           | [0]: https://arxiv.org/html/2310.05209v2
           | 
           | [1]: https://qwenlm.github.io/blog/qwen2.5-turbo/#passkey-
           | retriev...
        
             | christianqchung wrote:
             | But Llama 4 Scout does badly on long context benchmarks
             | despite claiming 10M. It scores 1 slot above Llama 3.1 8B
             | in this one[1].
             | 
             | [1] https://github.com/adobe-research/NoLiMa
        
               | omneity wrote:
               | Indeed, but it does not take away the fact that long
               | context is not trained through long content but by
               | scaling short content instead.
        
             | kmeisthax wrote:
             | Is there any evidence that GPT-4.1 is using RoPE to scale
             | context?
             | 
             | Also, I don't know about Qwen, but I know Llama 4 has
             | severe performance issues, so I wouldn't use that as an
             | example.
        
               | omneity wrote:
               | I am not sure about public evidence. But the memory
               | requirements alone to train on 1M long windows would make
               | it a very unrealistic proposition compared to RoPE
               | scaling. And as I mentioned RoPE is essential for long
               | context anyway. You can't train it in the "normal way".
               | Please see the paper I linked previously for more context
               | (pun not intended) on RoPE.
               | 
               | Re: Llama 4, please see the sibling comment.
        
         | daemonologist wrote:
         | I ran NoLiMa on Quasar Alpha (GPT-4.1's stealth mode):
         | https://news.ycombinator.com/item?id=43640166#43640790
         | 
         | Updated results from the authors: https://github.com/adobe-
         | research/NoLiMa
         | 
         | It's the best known performer on this benchmark, but still
         | falls off quickly at even relatively modest context lengths
         | (85% perf at 16K). (Cutting edge reasoning models like Gemini
         | 2.5 Pro haven't been evaluated due to their cost and might
         | outperform it.)
        
       | soheil wrote:
       | Main takeaways:
       | 
       | - Coding accuracy improved dramatically
       | 
       | - Handles 1M-token context reliably
       | 
       | - Much stronger instruction following
        
       | theturtletalks wrote:
       | With these being 1M context size, does that all but confirm that
       | Quasar Alpha and Optimus Alpha were cloaked OpenAI models on
       | OpenRouter?
        
         | atemerev wrote:
         | Yes, confirmed by citing Aider benchmarks:
         | https://openai.com/index/gpt-4-1/
         | 
         | Which means that these models are _absolutely_ not SOTA, and
         | Gemini 2.5 pro is much better, and Sonnet is better, and even
         | R1 is better.
         | 
         | Sorry Sam, you are losing the game.
        
           | Tinkeringz wrote:
           | Aren't all of these reasoning models?
           | 
           | Won't the reasoning models of openAI benchmarked against
           | these be a test of if Sam is losing?
        
             | atemerev wrote:
             | There is no OpenAI model better than R1, reasoning or not
             | (as confirmed by the same Aider benchmark; non-coding tests
             | are less objective, but I think it still holds).
             | 
             | With Gemini (current SOTA) and Sonnet (great potential, but
             | tends to overengineer/overdo things) it is debatable, they
             | are probably better than R1 (and all OpenAI models by
             | extension).
        
         | arvindh-manian wrote:
         | I think Quasar is fairly confirmed [0] to be OpenAI.
         | 
         | [0] https://x.com/OpenAI/status/1911782243640754634
        
         | phoe18 wrote:
         | Yes, OpenRouter confirmed it here -
         | https://x.com/OpenRouterAI/status/1911833662464864452
        
       | jmkni wrote:
       | The increased context length is interesting.
       | 
       | It would be incredible to be able to feed an entire codebase into
       | a model and say "add this feature" or "we're having a bug where X
       | is happening, tell me why", but then you are limited by the
       | output token length
       | 
       | As others have pointed out too, the more tokens you use, the less
       | accuracy you get and the more it gets confused, I've noticed this
       | too
       | 
       | We are a ways away yet from being able to input an entire
       | codebase, and have it give you back an updated version of that
       | codebase.
        
       | impure wrote:
       | I like how Nano matches Gemini 2.0 Flash's price. That will help
       | drive down prices which will be good for my app. However I don't
       | like how Nano behaves worse than 4o Mini in some benchmarks.
       | Maybe it will be good enough, we'll see.
        
         | chaos_emergent wrote:
         | Theory here is that 4.1-nano is competing with that tier, 4.1
         | with flash-thinking (although likely to do significantly
         | worse), and o4-mini or o3-large will compete with 2.5 thinking
        
         | pzo wrote:
         | yeah and considering that gemini 2.0 flash is much better than
         | 4o-mini. On top of that gemini have also audio input as
         | modality and realtime API for both audio input and output + web
         | search grounding + free tier.
        
       | pcwelder wrote:
       | Can someone explain to me why we should take Aider's polyglot
       | benchmark seriously?
       | 
       | All the solutions are already available on the internet on which
       | various models are trained, albeit in various ratios.
       | 
       | Any variance could likely be due to the mix of the data.
        
         | meroes wrote:
         | To join in the faux rigor?
        
         | philipbjorge wrote:
         | If you're looking to test an LLMs ability to solve a coding
         | task without prior knowledge of the task at hand, I don't think
         | their benchmark is super useful.
         | 
         | If you care about understanding relative performance between
         | models for solving known problems and producing correct output
         | format, it's pretty useful.
         | 
         | - Even for well-known problems, we see a large distribution of
         | quality between models (5 to 75% correctness) - Additionally,
         | we see a large distribution of model's ability to produce
         | responses in formats they were instructed in
         | 
         | At the end of the day, benchmarks are pretty fuzzy, but I
         | always welcome a formalized benchmark as a means to understand
         | model performance over vibe checking.
        
       | asdev wrote:
       | > We will also begin deprecating GPT-4.5 Preview in the API, as
       | GPT-4.1 offers improved or similar performance on many key
       | capabilities at much lower cost and latency.
       | 
       | why would they deprecate when it's the better model? too
       | expensive?
        
         | ComputerGuru wrote:
         | > why would they deprecate when it's the better model? too
         | expensive?
         | 
         | Too expensive, but not for them - for their customers. The only
         | reason they'd deprecated it is if it wasn't seeing usage worth
         | keeping it up and that probably stems from it being insanely
         | more expensive and slower than everything else.
        
         | tootyskooty wrote:
         | sits on too many GPUs, they mentioned it during the stream
         | 
         | I'm guessing the (API) demand isn't there to saturate them
         | fully
        
         | simianwords wrote:
         | Where did you find that 4.5 is a better model? Everything from
         | the video told me that 4.5 was largely a mistake and 4.1 beats
         | 4.5 at everything. There's no point keeping 4.5 at this point.
        
           | rob wrote:
           | Bigger numbers are supposed to mean better. 3.5, 4, 4.5.
           | Going from 4 to 4.5 to 4.1 seems weird to most people. If
           | it's better, it should of been GPT-4.6 or 5.0 or something
           | else, not a downgraded number.
        
             | HDThoreaun wrote:
             | OpenAI has decided to troll via crappy naming conventions
             | as a sort of in joke. Sam Altman tweets about it pretty
             | often
        
       | taikahessu wrote:
       | > They feature a refreshed knowledge cutoff of June 2024.
       | 
       | As opposed to Gemini 2.5 Pro having cutoff of Jan 2025.
       | 
       | Honestly this feels underwhelming and surprising. Especially if
       | you're coding with frameworks with breaking changes, this can
       | hurt you.
        
         | forbiddenvoid wrote:
         | It's definitely an issue. Even the simplest use case of "create
         | React app with Vite and Tailwind" is broken with these models
         | right now because they're not up to date.
        
           | asadm wrote:
           | usually enabling "Search" fixes it sometimes as they fetch
           | the newer methods.
        
           | lukev wrote:
           | Time to start moving back to Java & Spring.
           | 
           | 100% backwards compatibility and well represented in 15 years
           | worth of training data, hah.
        
             | speedgoose wrote:
             | Write once, run nowhere.
        
           | Zambyte wrote:
           | By "broken" you mean it doesn't use the latest and greatest
           | hot trend, right? Or does it literally not work?
        
             | dbbk wrote:
             | Periodically I keep trying these coding models in Copilot
             | and I have yet to have an experience where it produced
             | working code with a pretty straightforward TypeScript
             | codebase. Specifically, it cannot for the life of it
             | produce working Drizzle code. It will hallucinate methods
             | that don't exist despite throwing bright red type errors.
             | Does it even check for TS errors?
        
               | dalmo3 wrote:
               | Not sure about Copilot, but the Cursor agent runs both
               | eslint and tsc by default and fixes the errors
               | automatically. You can tell it to run tests too, and
               | whatever other tools. I've had a good experience writing
               | drizzle schemas with it.
        
             | taikahessu wrote:
             | It has been really frustrating learning Godot (or any new
             | technology you are not familiar with) 4.4.x with GPT4o or
             | even worse, with custom GPT which use older GPT4turbo.
             | 
             | As you are new in the field, it kinda doesn't make sense to
             | pick an older version. It would be better if there was no
             | data than incorrect data. You literally have to include the
             | version number on every prompt and even that doesn't
             | guarantee a right result! Sometimes I have to play truth or
             | dare three times before we finally find the right names and
             | instructions. Yes I have the version info on all custom
             | information dialogs, but it is not as effective as
             | including it in the prompt itself.
             | 
             | Searching the web feels like an on-going "I'm feeling
             | lucky" mode. Anyway, I still happen to get some real
             | insights from GPT4o, even though Gemini 2.5 Pro has proven
             | far superior for larger and more difficult contexts /
             | problems.
             | 
             | The best storytelling ideas have come from GPT 4.5. Looking
             | forward to testing this new 4.1 as well.
        
               | jonfw wrote:
               | hey- curious what your experience has been like learning
               | godot w/ LLM tooling.
               | 
               | are you doing 3d? The 3D tutorial ecosystem is very GUI
               | heavy and I have had major problems trying to get godot
               | to do anything 3D
        
           | yokto wrote:
           | Whenever an LLM struggles with a particular library version,
           | I use Cursor Rules to auto-include migration information and
           | that generally worked well enough in my cases.
        
           | tengbretson wrote:
           | A few weeks back I couldn't even get ChatGPT to output
           | TypeScript code that correctly used the OpenAI SDK.
        
             | seuros wrote:
             | You should give it documentation is can't guess.
        
           | alangibson wrote:
           | Try getting then to output Svelte 5 code...
        
             | division_by_0 wrote:
             | Svelte 5 is the antidote to vibe coding.
        
           | int_19h wrote:
           | Maybe LLMs will be the forcing function to finally slow down
           | the crazy pace of changing (and breaking) things in
           | JavaScript land.
        
         | TIPSIO wrote:
         | It it annoying. The bigger cheaper context windows help this a
         | little though:
         | 
         | E.g.: If context windows get big and cheap enough (as things
         | are trending), hopefully you can just dump the entire docs,
         | examples, and more in every request.
        
       | j_maffe wrote:
       | OAI are so ahead of the competition, they don't need to compare
       | with the competition anymore /s
        
         | neal_ wrote:
         | hahahahaha
        
       | forbiddenvoid wrote:
       | Lots of improvements here (hopefully), but still no image
       | generation updates, which is what I'm most eager for right now.
        
         | taikahessu wrote:
         | Or text to speech generation ... but I guess that is coming.
        
           | dharmab wrote:
           | Yeah, I tried the 4o models and they severely mispronounced
           | common words and read numbers incorrectly (eg reading 16000
           | as 1600)
        
         | Tinkeringz wrote:
         | They just realised a new image generation a couple of weeks
         | ago, why are you eager for another one so soon?
        
           | nanook wrote:
           | Are the image generation improvements available via API?
           | Don't think so
        
       | ComputerGuru wrote:
       | The benchmarks and charts they have up are frustrating because
       | they don't include 03-mini(-high) which they've been pushing as
       | the low-latency+low-cost smart model to use for coding challenges
       | instead of 4o and 4o-mini. Why won't they include that in the
       | charts?
        
       | marsh_mellow wrote:
       | From OpenAI's announcement:
       | 
       | > Qodo tested GPT-4.1 head-to-head against Claude Sonnet 3.7 on
       | generating high-quality code reviews from GitHub pull requests.
       | Across 200 real-world pull requests with the same prompts and
       | conditions, they found that GPT-4.1 produced the better
       | suggestion in 55% of cases. Notably, they found that GPT-4.1
       | excels at both precision (knowing when not to make suggestions)
       | and comprehensiveness (providing thorough analysis when
       | warranted).
       | 
       | https://www.qodo.ai/blog/benchmarked-gpt-4-1/
        
         | arvindh-manian wrote:
         | Interesting link. Worth noting that the pull requests were
         | judged by o3-mini. Further, I'm not sure that 55% vs 45% is a
         | huge difference.
        
           | marsh_mellow wrote:
           | Good point. They said they validated the results by testing
           | with other models (including Claude), as well as with manual
           | sanity checks.
           | 
           | 55% to 45% definitely isn't a blowout but it is meaningful --
           | in terms of ELO it equates to about a 36 point difference. So
           | not in a different league but definitely a clear edge
        
         | InkCanon wrote:
         | >4.1 Was better in 55% of cases
         | 
         | Um, isn't that just a fancy way of saying it is slightly better
         | 
         | >Score of 6.81 against 6.66
         | 
         | So very slightly better
        
           | kevmo314 wrote:
           | A great way to upsell 2% better! I should start doing that.
        
             | neuroelectron wrote:
             | Good marketing if you're selling a discount all purpose
             | cleaner, not so much for an API.
        
           | marsh_mellow wrote:
           | I don't think the absolute score means much -- judge models
           | have a tendency to score around 7/10 lol
           | 
           | 55% vs. 45% equates to about a 36 point difference in ELO. in
           | chess that would be two players in the same league but one
           | with a clear edge
        
             | kevmo314 wrote:
             | Rarely are two models put head-to-head though. If Claude
             | Sonnet 3.7 isn't able to generate a good PR review (for
             | whatever reason), a 2% better review isn't all that strong
             | of a value proposition.
        
               | swyx wrote:
               | the point is oai is saying they have a viable Claude
               | Sonnet competitor now
        
           | wiz21c wrote:
           | "they found that GPT-4.1 excels at both precision..."
           | 
           | They didn't say it is better than Claude at precision etc.
           | Just that it excels.
           | 
           | Unfortunately, AI has still not concluded that manipulations
           | by the marketing dept is a plague...
        
         | jsnell wrote:
         | That's not a lot of samples for such a small effect, I don't
         | think it's statistically significant (p-value of around 10%).
        
           | swyx wrote:
           | is there a shorthand/heuristic to calculate pvalue given n
           | samples and effect size?
        
           | marsh_mellow wrote:
           | p-value of 7.9% -- so very close to statistical significance.
           | 
           | the p-value for GPT-4.1 having a win rate of at least 49% is
           | 4.92%, so we can say conclusively that GPT-4.1 is at least
           | (essentially) evenly matched with Claude Sonnet 3.7, if not
           | better.
           | 
           | Given that Claude Sonnet 3.7 has been generally considered to
           | be the best (non-reasoning) model for coding, and given that
           | GPT-4.1 is substantially cheaper ($2/million input,
           | $8/million output vs. $3/million input, $15/million output),
           | I think it's safe to say that this is significant news,
           | although not a game changer
        
             | jsnell wrote:
             | I make it 8.9% with a binomial test[0]. I rounded that to
             | 10%, because any more precision than that was not
             | justified.
             | 
             | Specifically, the results from the blog post are
             | impossible: with 200 samples, you can't possibly have the
             | claimed 54.9/45.1 split of binary outcomes. Either they
             | didn't actually make 200 tests but some other number, they
             | didn't actually get the results they reported, or they did
             | some kind of undocumented data munging like excluding all
             | tied results. In any case, the uncertainty about the input
             | data is larger than the uncertainty from the rounding.
             | 
             | [0] In R, binom.test(110, 200, 0.5, alternative="greater")
        
       | simianwords wrote:
       | Could any one guess the reason as to why they didn't ship this in
       | the chat UI?
        
         | KoolKat23 wrote:
         | The memory thing? More resources intensive?
        
       | nikcub wrote:
       | Easy to miss in the announcement that 4.5 is being shut down
       | 
       | > GPT-4.5 Preview will be turned off in three months, on July 14,
       | 2025
        
         | OxfordOutlander wrote:
         | Juice not worth the squeeze I imagine. 4.5 is chonky, and
         | having to reserve GPU space for it must not have been worth it.
         | Makes sense to me - I hadn't founding anything it was so much
         | better at that it was worth the incremental cost over Sonnet
         | 3.7 or o3-mini.
        
       | pcwelder wrote:
       | Did some quick tests. I believe its the same model as Quasar. It
       | struggles with agentic loop [1]. You'd have to force it to do
       | tool calls.
       | 
       | Tool use ability feels ability better than gemini-2.5-pro-exp [2]
       | which struggles with JSON schema understanding sometimes.
       | 
       | Llama 4 has suprising agentic capabilities, better than both of
       | them [3] but isn't as intelligent as the others.
       | 
       | [1]
       | https://github.com/rusiaaman/chat.md/blob/main/samples/4.1/t...
       | 
       | [2]
       | https://github.com/rusiaaman/chat.md/blob/main/samples/gemin...
       | 
       | [3]
       | https://github.com/rusiaaman/chat.md/blob/main/samples/llama...
        
         | ludwik wrote:
         | Correct. They've mentioned the name during the live
         | announcement - https://www.youtube.com/live/kA-P9ood-
         | cE?si=GYosi4FtX1YSAujE...
        
       | simonw wrote:
       | Here's a summary of this Hacker News thread created by GPT-4.1
       | (the full sized model) when the conversation hit 164 comments:
       | https://gist.github.com/simonw/93b2a67a54667ac46a247e7c5a2fe...
       | 
       | I think it did very well - it's clearly good at instruction
       | following.
       | 
       | Total token cost: 11,758 input, 2,743 output = 4.546 cents.
       | 
       | Same experiment run with GPT-4.1 mini:
       | https://gist.github.com/simonw/325e6e5e63d449cc5394e92b8f2a3...
       | (0.8802 cents)
       | 
       | And GPT-4.1 nano:
       | https://gist.github.com/simonw/1d19f034edf285a788245b7b08734...
       | (0.2018 cents)
        
       | swyx wrote:
       | don't miss that OAI also published a prompting guide WITH
       | RECEIPTS for GPT 4.1 specifically for those building agents...
       | with a new recommendation for:
       | 
       | - telling the model to be persistent (+20%)
       | 
       | - dont self-inject/parse toolcalls (+2%)
       | 
       | - prompted planning (+4%)
       | 
       | - JSON BAD - use XML or arxiv 2406.13121 (GDM format)
       | 
       | - put instructions + user query at TOP -and- BOTTOM - bottom-only
       | is VERY BAD
       | 
       | - no evidence that ALL CAPS or Bribes or Tips or threats to
       | grandma work
       | 
       | source:
       | https://cookbook.openai.com/examples/gpt4-1_prompting_guide#...
        
         | simonw wrote:
         | I'm surprised and a little disappointed by the result
         | concerning instructions at the top, because it's incompatible
         | with prompt caching: I would much rather cache the part of the
         | prompt that includes the long document and then swap out the
         | user question at the end.
        
           | swyx wrote:
           | yep. we address it in the podcast. presumably this is just a
           | recent discovery and can be post-trained away.
        
             | aoeusnth1 wrote:
             | If you're skimming a text to answer a specific question,
             | you can go a lot faster than if you have to memorize the
             | text well enough to answer an unknown question after the
             | fact.
        
           | zaptrem wrote:
           | Prompt on bottom is also easier for humans to read as I can
           | have my actual question and the model's answer on screen at
           | the same time instead of scrolling through 70k tokens of
           | context between them.
        
           | mmoskal wrote:
           | The way I understand it: if the instruction are at the top,
           | the KV entries computed for "content" can be influenced by
           | the instructions - the model can "focus" on what you're
           | asking it to do and perform some computation, while it's
           | "reading" the content. Otherwise, you're completely relaying
           | on attention to find the information in the content, leaving
           | it much less token space to "think".
        
         | swyx wrote:
         | references for all the above + added more notes here on pricing
         | https://x.com/swyx/status/1911849229188022278
         | 
         | and we'll be publishing our 4.1 pod later today
         | https://www.youtube.com/@latentspacepod
        
         | pton_xd wrote:
         | As an aside, one of the worst aspects of the rise of LLMs, for
         | me, has been the wholesale replacement of engineering with
         | trial-and-error hand-waving. Try this, or maybe that, and maybe
         | you'll see a +5% improvement. Why? Who knows.
         | 
         | It's just not how I like to work.
        
           | pclmulqdq wrote:
           | Software engineering has involved a lot of people doing
           | trial-and-error hand-waving for at least a decade. We are now
           | codifying the trend.
        
           | zoogeny wrote:
           | I think trial-and-error hand-waving isn't all that far from
           | experimentation.
           | 
           | As an aside, I was working in the games industry when multi-
           | core was brand new. Maybe Xbox-360 and PS3? I'm hazy on the
           | exact consoles but there was one generation where the major
           | platforms all went multi-core.
           | 
           | No one knew how to best use the multi-core systems for
           | gaming. I attended numerous tech talks by teams that had
           | tried different approaches and were give similar "maybe do
           | this and maybe see x% improvement?". There was a lot of
           | experimentation. It took a few years before things settled
           | and best practices became even somewhat standardized.
           | 
           | Some people found that era frustrating and didn't like to
           | work in that way. Others loved the fact it was a wide open
           | field of study where they could discover things.
        
             | jorvi wrote:
             | Yes, it was the generation of the X360 and PS3. X360 was
             | triple core and the PS3 was 1+7 core (sort of a big.little
             | setup).
             | 
             | Although it took many, many more years until games started
             | to actually use multi-core properly. With rendering being
             | on a 16.67ms / 8.33ms budget and rendering tied to world
             | state, it was just really hard to not tie everything into
             | eachother.
             | 
             | Even today you'll usually only see 2-4 cores actually
             | getting significant load.
        
           | brokencode wrote:
           | Out of curiosity, what do you work on where you don't have to
           | experiment with different solutions to see what works best?
        
             | FridgeSeal wrote:
             | Usually when we're doing it in practice there's _somewhat_
             | more awareness of the mechanics than just throwing random
             | obstructions in and hoping for the best.
        
               | RussianCow wrote:
               | LLMs are still very young. We'll get there in time. I
               | don't see how it's any different than optimizing for new
               | CPU/GPU architectures other than the fact that the latter
               | is now a decades-old practice.
        
               | girvo wrote:
               | > I don't see how it's any different than optimizing for
               | new CPU/GPU architectures
               | 
               | I mean that seems wild to say to me. Those architectures
               | have documentation and aren't magic black boxes that we
               | chuck inputs at and hope for the best: we do pretty much
               | that with LLMs.
               | 
               | If that's how you optimise, I'm genuinely shocked.
        
             | greenchair wrote:
             | most people are building straightforward crud apps. no
             | experimentation required.
        
               | RussianCow wrote:
               | [citation needed]
               | 
               | In my experience, even simple CRUD apps generally have
               | some domain-specific intricacies or edge cases that take
               | some amount of experimentation to get right.
        
               | brokencode wrote:
               | Idk, it feels like this is what you'd expect versus the
               | actual reality of building something.
               | 
               | From my experience, even building on popular platforms,
               | there are many bugs or poorly documented behaviors in
               | core controls or APIs.
               | 
               | And performance issues in particular can be difficult to
               | fix without trial and error.
        
           | kitsunemax wrote:
           | I feel like this a common pattern with people who work in
           | STEM. As someone who is used to working with formal proofs,
           | equations, math, having a startup taught me how to rewire
           | myself to work with the unknowns, imperfect solutions, messy
           | details. I'm going on a tangent, but just wanted to share.
        
         | behnamoh wrote:
         | > - JSON BAD - use XML or arxiv 2406.13121 (GDM format)
         | 
         | And yet, all function calling and MCP is done through JSON...
        
           | CSMastermind wrote:
           | Yeah anyone who has worked with these models knows how much
           | they struggle with JSON inputs.
        
           | swyx wrote:
           | JSON is just MCP's transport layer. you can reformat to xml
           | to pass into model
        
         | Havoc wrote:
         | >- dont self-inject/parse toolcalls (+2%)
         | 
         | What is meant by this?
        
           | intalentive wrote:
           | Use the OpenAI API/SDK for function calling instead of
           | rolling your own inside the prompt.
        
         | minimaxir wrote:
         | > no evidence that ALL CAPS or Bribes or Tips or threats to
         | grandma work
         | 
         | Challenge accepted.
         | 
         | That said, the exact quote from the linked notebook is "It's
         | generally not necessary to use all-caps or other incentives
         | like bribes or tips, but developers can experiment with this
         | for extra emphasis if so desired.", but the demo examples
         | OpenAI provides do like using ALL CAPS.
        
         | kristianp wrote:
         | The size of that SWE-bench Verified prompt shows how much work
         | has gone into the prompt to get the highest possible score for
         | that model. A third party might go to a model from a different
         | provider before going to that extent of fine-tuning of the
         | prompt.
        
       | frognumber wrote:
       | Marginally on-topic: I'd love if the charts included prior
       | models, including GPT 4 and 3.5.
       | 
       | Not all systems upgrade every few months. A major question is
       | when we reach step-improvements in performance warranting a re-
       | eval, redesign of prompts, etc.
       | 
       | There's a small bleeding edge, and a much larger number of
       | followers.
        
       | bartkappenburg wrote:
       | By leaving out scale or prior models they are effectively
       | manipulating improvement. If from 3 to 4 it was from 10 to 80,
       | and from 4 to 4o it was 80 to 82, leaving out 3 would let us see
       | a steep line instead of steep decrease of growth.
       | 
       | Lies, damn lies and statistics ;-)
        
       | growt wrote:
       | My theory: they need to move off the 4o version number before
       | releasing o4-mini next week or so.
        
         | kgeist wrote:
         | The 'oN' schema was a such strange choice for branding. They
         | had to skip 'o2' because it's already trademarked, and now 'o4'
         | can easily be confused with '4o'.
        
       | neal_ wrote:
       | The better the benchmarks, the worse the model is. Subjectively
       | for me the more advanced models dont follow instructions, and are
       | less capable of implementing features or building stuff. I could
       | not tell a difference in blind testing SOTA models gemini,
       | claude, openai, deepseek. There has been no major improvements in
       | the LLM space since the original models gained popularity. Each
       | release claims to be much better the last, and every time i have
       | been disappointed and think this is worse.
       | 
       | First it was the models stopped putting in effort and felt lazy,
       | tell it to do something and it will tell you to do it your self.
       | Now its the opposite and the models go ham changing everything
       | they see, instead of changing one line, SOTA models rather
       | rewrite the whole project and still not fix the issue.
       | 
       | Two years back I totally thought these models are amazing. I
       | always would test out the newest models and would get hyped up
       | about it. Every problem i had i thought if i just prompt it
       | differently I can get it to solve this. Often times i have spent
       | hours prompting starting new chats, adding more context. Now i
       | realize its kinda useless and its better to just accept the
       | models where they are, rather then try and make them a one stop
       | shop, or try to stretch capabilities.
       | 
       | I think this release I won't even test it out, im not interested
       | anymore. I'll probably just continue using deepseek free, and
       | gemini free. I canceled my openai subscription like 6 months ago,
       | and canceled claude after 3.7 disappointment.
        
       | T3uZr5Fg wrote:
       | While impressive that the assistants can use dynamic tools and
       | reason about images, I'm most excited about the improvements to
       | factual accuracy and instruction following. The RAG capabilities
       | with cross-validation seem particularly useful for serious
       | development work, not just toy demos.
        
       | 999900000999 wrote:
       | Have they implemented "I don't know" yet.
       | 
       | I probably spend 100$ a month on AI coding, and it's great at
       | small straightforward tasks.
       | 
       | Drop it into a larger codebase and it'll get confused. Even if
       | the same tool built it in the first place due to context limits.
       | 
       | Then again, the way things are rapidly improving I suspect I can
       | wait 6 months and they'll have a model that can do what I want.
        
         | cheschire wrote:
         | I wonder if documentation would help to create an carefully and
         | intentionally tokenized overview of the system. Maximize the
         | amount of routine larger scope information provided in minimal
         | tokens in order to leave room for more immediate context.
         | 
         | Similar to the function documentation provides to developers
         | today, I suppose.
        
           | yokto wrote:
           | It does, shockingly well in my experience. Check out this
           | blog post outlining such an approach, called Literate
           | Development by the author:
           | https://news.ycombinator.com/item?id=43524673
        
       | vinhnx wrote:
       | * Flagship GPT-4.1: top-tier intelligence, full endpoints &
       | premium features
       | 
       | * GPT-4.1-mini: balances performance, speed & cost
       | 
       | * GPT-4.1-nano: prioritizes throughput & low cost with
       | streamlined capabilities
       | 
       | All share a 1 million-token context window (vs 120-200k on
       | 4o-o3/o1), excelling in instruction following, tool calls &
       | coding.
       | 
       | Benchmarks vs prior models:
       | 
       | * AIME '24: 48.1% vs 13.1% (~3.7x gain)
       | 
       | * MMLU: 90.2% vs 85.7% (+4.5 pp)
       | 
       | * Video-MME: 72.0% vs 65.3% (+6.7 pp)
       | 
       | * SWE-bench Verified: 54.6% vs 33.2% (+21.4 pp)
        
       | comex wrote:
       | Sam Altman wrote in February that GPT-4.5 would be "our last non-
       | chain-of-thought model" [1], but GPT-4.1 also does not have
       | internal chain-of-thought [2].
       | 
       | It seems like OpenAI keeps changing its plans. Deprecating
       | GPT-4.5 less than 2 months after introducing it also seems
       | unlikely to be the original plan. Changing plans is necessarily a
       | bad thing, but I wonder why.
       | 
       | Did they not expect this model to turn out as well as it did?
       | 
       | [1] https://x.com/sama/status/1889755723078443244
       | 
       | [2] https://github.com/openai/openai-
       | cookbook/blob/6a47d53c967a0...
        
         | wongarsu wrote:
         | Maybe that's why they named this model 4.1, despite coming out
         | after 4.5 and supposedly outperforming it. They can pretend
         | GPT-4.5 is the last non-chain-of-thought model by just giving
         | all non-chain-of-thought-models version numbers below 4.5
        
           | chrisweekly wrote:
           | Ok, I know naming things is hard, but 4.1 comes out after
           | 4.5? Just, wat.
        
             | CamperBob2 wrote:
             | For a long time, you could fool models with questions like
             | "Which is greater, 4.10 or 4.5?" Maybe they're still
             | struggling with that at OpenAI.
        
               | ben_w wrote:
               | At this point, I'm just assuming most AI models -- not
               | just OpenAI's -- name themselves. And that they write
               | their own press releases.
        
         | Cheer2171 wrote:
         | Why do you expect to believe a single word Sam Altman says?
        
           | sigmoid10 wrote:
           | Everyone assumed malice when the board fired him for not
           | always being "candid" - but it seems more and more that he's
           | just clueless. He's definitely capable when it comes to
           | raising money as a business, but I wouldn't count on any tech
           | opinion from him.
        
         | observationist wrote:
         | Anyone making claims with a horizon beyond two months about
         | structure or capabilities will be wrong - it's sama's job to
         | show confidence and vision and calm stakeholders, but if you're
         | paying attention to the field, the release and research cycles
         | are still contracting, with no sense of slowing any time soon.
         | I've followed AI research daily since GPT-2, the momentum is
         | incredible, and even if the industry sticks with transformers,
         | there are years left of low hanging fruit and incremental
         | improvements before things start slowing.
         | 
         | There doesn't appear to be anything that these AI models cannot
         | do, in principle, given sufficient data and compute. They've
         | figured out multimodality and complex integration, self play
         | for arbitrary domains, and lots of high-cost longer term
         | paradigms that will push capabilities forwards for at least 2
         | decades in conjunction with Moore's law.
         | 
         | Things are going to continue getting better, faster, and
         | weirder. If someone is making confident predictions beyond
         | those claims, it's probably their job.
        
           | sottol wrote:
           | Maybe that's true for absolute arm-chair-engineering
           | outsiders (like me) but these models are in training for
           | months, training data is probably being prepared year(s) in
           | advance. These models have a knowledge cut-off in 2024 - so
           | they have been in training for a while. There's no way sama
           | did not have a good idea that this non-COT was in the
           | pipeline 2 months ago. It was probably finished training then
           | and undergoing evals.
           | 
           | Maybe
           | 
           | 1. he's just doing his job and hyping OpenAI's competitive
           | advantages (afair most of the competition didn't have decent
           | COT models in Feb), or
           | 
           | 2. something changed and they're releasing models now that
           | they didn't intend to release 2 months ago (maybe because a
           | model they did intend to release is not ready and won't be
           | for a while), or
           | 
           | 3. COT is not really as advantageous as it was deemed to be
           | 2+ months ago and/or computationally too expensive.
        
             | fragmede wrote:
             | With new hardware from Nvidia announced coming out, those
             | months turn into weeks.
        
               | sottol wrote:
               | I doubt it's going to be weeks, the months were already
               | turning into years despite Nvidia's previous advances.
               | 
               | (Not to say that it takes openai years to train a new
               | model, just that the timeline between major GPT releases
               | seems to double... be it for data gathering, training,
               | taking breaks between training generations, ... - either
               | way, model training seems to get harder not easier).
               | 
               | GPT Model | Release Date | Months Passed Between Former
               | Model
               | 
               | GPT-1 | 11.06.2018
               | 
               | GPT-2 | 14.02.2019 | 8.16
               | 
               | GPT-3 | 28.05.2020 | 15.43
               | 
               | GPT-4 | 14.03.2023 | 33.55
               | 
               | [1]https://www.lesswrong.com/posts/BWMKzBunEhMGfpEgo/when
               | -will-...
        
           | moojacob wrote:
           | > Things are going to continue getting better, faster, and
           | weirder.
           | 
           | I love this. Especially the weirder part. This tech can be
           | useful in every crevice of society and we still have no idea
           | what new creative use cases there are.
           | 
           | Who would've guessed phones and social media would cause mass
           | protests because bystanders could record and distribute
           | videos of the police?
        
             | staunton wrote:
             | > Who would've guessed phones and social media would cause
             | mass protests because bystanders could record and
             | distribute videos of the police?
             | 
             | That would have been quite far down on my list of "major
             | (unexpected) consequences of phones and social media"...
        
           | authorfly wrote:
           | the release and research cycles are still contracting
           | 
           | Not necessarily progress or benchmarks that as a broader
           | picture you would look at (MMLU etc)
           | 
           | GPT-3 was an amazing step up from GPT-2, something scientists
           | in the field really thought was 10-15 years out at least done
           | in 2, instruct/RHLF for GPTs was a similar massive splash,
           | making the second half of 2021 equally amazing.
           | 
           | However nothing since has really been that left field or
           | unpredictable from then, and it's been almost 3 years since
           | RHLF hit the field. We knew good image understanding as
           | input, longer context, and improved prompting would improve
           | results. The releases are common, but the progress feels like
           | it has stalled for me.
           | 
           | What really has changed since Davinci-instruct or ChatGPT to
           | you? When making an AI-using product, do you construct it
           | differently? Are agents presently more than APIs talking to
           | databases with private fields?
        
             | hectormalot wrote:
             | In some dimensions I recognize the slow down in how fast
             | new capabilities develop, but the speed still feels very
             | high:
             | 
             | Image generation suddenly went from gimmick to useful now
             | that prompt adherence is so much better (eagerly waiting
             | for that to be in the API)
             | 
             | Coding performance continues to improve noticeably (for
             | me). Claude 3.7 felt like a big step from 4o/3.5. Gemini
             | 2.5 in a similar way.compared to just 6 months ago I can
             | give bigger and more complex pieces of work to it and get
             | relatively good output back. (Net acceleration)
             | 
             | Audio-2-audio seems like it will be a big step as well. I
             | think this has much more potential than the STT-LLM-TTS
             | architecture commonly used today (latency, quality)
        
             | liamwire wrote:
             | Excuse the pedantry; for those reading, it's RLHF rather
             | than RHLF.
        
             | kadushka wrote:
             | I see a huge progress made since the first gpt-4 release.
             | The reliability of answers has improved an order of
             | magnitude. Two years ago, more than half of my questions
             | resulted in incorrect or partially correct answers (most of
             | my queries are about complicated software algorithms or phd
             | level research brainstorming). A simple "are you sure"
             | prompt would force the model to admit it was wrong most of
             | the time. Now with o1 this almost never happens and the
             | model seems to be smarter or at least more capable than me
             | - in general. GPT-4 was a bright high school student. o1 is
             | a postdoc.
        
         | adamgordonbell wrote:
         | Perhaps it is a distilled 4.5, or based on it's lineage, as
         | some suggested.
        
         | zitterbewegung wrote:
         | I think that people balked at the cost of 4.5 and really wanted
         | just a slightly more improved 4o . Now it almost seems that
         | they will have a separate products that are non chain of
         | thought and chain of thought series which actually makes sense
         | because some want a cheap model and some don't.
        
         | freehorse wrote:
         | > Deprecating GPT-4.5 less than 2 months after introducing it
         | also seems unlikely to be the original plan.
         | 
         | Well they actually hinted already of possible depreciation in
         | their initial announcement of gpt4.5 [0]. Also, as others said,
         | this model was already offered in the api as chatgpt-latest,
         | but there was no checkpoint which made it unreliable for actual
         | use.
         | 
         | [0] https://openai.com/index/introducing-
         | gpt-4-5/#:~:text=we%E2%...
        
         | resource_waste wrote:
         | When I saw them say 'no more non COT models', I was minorly
         | panicked.
         | 
         | While their competitors have made fantastic models, at the time
         | I perceived ChatGPT4 was the best model for many applications.
         | COT was often tricked by my prompts, assuming things to be
         | true, when a non-COT model would say something like 'That isnt
         | necessarily the case'.
         | 
         | I use both COT and non when I have an important problem.
         | 
         | Seeing them keep a non-COT model around is a good idea.
        
       | gcy wrote:
       | 4.10 > 4.5 -- @stevenheidel
       | 
       | @sama: underrated tweet
       | 
       | Source: https://x.com/stevenheidel/status/1911833398588719274
        
         | wongarsu wrote:
         | Too bad OpenAI named it 4.1 instead of 4.10. You can either
         | claim 4.10 > 4.5 (the dots separate natural numbers) or 4.1 ==
         | 4.10 (they are decimal numbers), but you can't have both at
         | once
        
         | stevenheidel wrote:
         | so true
        
       | furyofantares wrote:
       | It's another Daft Punk day. Change a string in your program* and
       | it's better, faster, cheaper: pick 3.
       | 
       | *Then fix all your prompts over the next two weeks.
        
       | wongarsu wrote:
       | Is the version number a retcon of 4.5? On OpenAI's models page
       | the names appear completely reasonable [1]: The o1 and o3
       | reasoning models, and non-reasoning there is 3.5, 4, 4o and 4.1
       | (let's pretend 4o makes sense). But that is only reasonable as
       | long as we pretend 4.5 never happened, which the models page
       | apparently does
       | 
       | 1: https://platform.openai.com/docs/models
        
       | esafak wrote:
       | More information here:
       | https://platform.openai.com/docs/models/gpt-4.1
       | https://platform.openai.com/docs/models/gpt-4.1-mini
       | https://platform.openai.com/docs/models/gpt-4.1-nano
        
       | LeicaLatte wrote:
       | i've recently set claude 3.7 as the default option for customers
       | when they start new chats in my app. this was a recent change,
       | and i'm feeling good about it. supporting multiple providers can
       | be a nightmare for customer service, especially when it comes to
       | billing and handling response quality queries. with so many
       | choices from just one provider, it simplifies things
       | significantly. curious about how openai manages customer service
       | internally.
        
       | bbstats wrote:
       | ok.
        
       | XCSme wrote:
       | I tried 4.1-mini and 4.1-nano. The response are a lot faster, but
       | for my use-case they seem to be a lot worse than 4o-mini(they
       | fail to complete the task when 4o-mini could do it). Maybe I have
       | to update my prompts...
        
         | XCSme wrote:
         | Even after updating my prompts, 4o-mini still seems to do
         | better than 4.1-mini or 4.1-nano for a data-processing task.
        
           | BOOSTERHIDROGEN wrote:
           | Mind sharing your system prompt?
        
             | XCSme wrote:
             | It's quite complex, but the task is to parse some HTML
             | content, or to choose from a list of URLs which one is the
             | best.
             | 
             | I will check again the prompt, maybe 4o-mini ignores some
             | instructions that 4.1 doesn't (instructions which might
             | result in the LLM returning zero data).
        
       | pbmango wrote:
       | I think an under appreciated reality is that all of the large AI
       | labs and OpenAI in particular are fighting multiple market
       | battles at once. This is coming across in both the number of
       | products and the packaging.
       | 
       | 1, to win consumer growth they have continued to benefit on hyper
       | viral moments, lately that was was image generation in 4o, which
       | likely was technically possible a long time before launched. 2,
       | for enterprise workloads and large API use, they seem to have
       | focused less lately but the pricing of 4.1 is clearly an answer
       | to Gemini which has been winning on ultra high volume and
       | consistency. 3, for full frontier benchmarks they pushed out 4.5
       | to stay SOTA and attract the best researchers. 4, on top of all
       | they they had to, and did, quickly answer the reasoning promise
       | and DeepSeek threat with faster and cheaper o models.
       | 
       | They are still winning many of these battles but history
       | highlights how hard multi front warfare is, at least for teams of
       | humans.
        
         | spiderfarmer wrote:
         | On that note, I want to see benchmarks for which LLM's are best
         | at translating between languages. To me, it's an entire product
         | category.
        
           | pbmango wrote:
           | There are probably many more small battles being fought or
           | emerging. I think voice and PDF parsing are growing battles
           | too.
        
         | kristianp wrote:
         | I agree. 4.1 seems to be a release that addresses shortcomings
         | of 4o in coding compared to Claude 3.7 and Gemini 2.0 and 2.5
        
       | pastureofplenty wrote:
       | The plagiarism machine got an update! Yay!
        
       | archeantus wrote:
       | "GPT-4.1 scores 54.6% on SWE-bench Verified, improving by
       | 21.4%abs over GPT-4o and 26.6%abs over GPT-4.5--making it a
       | leading model for coding."
       | 
       | 4.1 is 26.6% better at coding than 4.5. Got it. Also...see the em
       | dash
        
         | drexlspivey wrote:
         | Should have named it 4.10
        
         | pdabbadabba wrote:
         | What's wrong with the em-dash? That's just...the
         | typographically correct dash AFAIK.
        
       | sharkjacobs wrote:
       | > You're eligible for free daily usage on traffic shared with
       | OpenAI through April 30, 2025.         > Up to 1 million tokens
       | per day across gpt-4.5-preview, gpt-4.1, gpt-4o and o1         >
       | Up to 10 million tokens per day across gpt-4.1-mini,
       | gpt-4.1-nano, gpt-4o-mini, o1-mini and o3-mini         > Usage
       | beyond these limits, as well as usage for other models, will be
       | billed at standard rates. Some limitations apply.
       | 
       | I just found this option in
       | https://platform.openai.com/settings/organization/data-contr...
       | 
       | Is just this something I haven't noticed before? Or is this new?
        
         | XCSme wrote:
         | So, that's like $10/day to give all your data/prompts?
        
         | sacrosaunt wrote:
         | Not new, launched in December 2024.
         | https://community.openai.com/t/free-tokens-on-traffic-shared...
        
       | __mharrison__ wrote:
       | I know this is somewhat off topic, but can someone explain the
       | naming convention used by OpenAI? Number vs "mini" vs "o" vs
       | "turbo" vs "chat"?
        
         | iteratethis wrote:
         | Mini means the size of the model (less parameters)
         | 
         | "o" means "omni", which means its multimodal.
        
       | kristianp wrote:
       | Looks like the Quasar and Optimus stealth models on Openrouter
       | were in fact GPT-4.1. This is what I get when I try to access the
       | openrouter/optimus-alpha model now:                   {"error":
       | {"message":"Quasar and Optimus were stealth models, and
       | revealed on April 14th as early testing versions of GPT 4.1.
       | Check it out: https://openrouter.ai/openai/gpt-4.1","code":404}
        
       | lxgr wrote:
       | As a ChatGPT user, I'm weirdly happy that it's not available
       | there yet. I already have to make a conscious choice between
       | 
       | - 4o (can search the web, use Canvas, evaluate Python server-
       | side, generate images, but has no chain of thought)
       | 
       | - o3-mini (web search, CoT, canvas, but no image generation)
       | 
       | - o1 (CoT, maybe better than o3, but no canvas or web search and
       | also no images)
       | 
       | - Deep Research (very powerful, but I have only 10 attempts per
       | month, so I end up using roughly zero)
       | 
       | - 4.5 (better in creative writing, and probably warmer sound
       | thanks to being vinyl based and using analog tube amplifiers, but
       | slower and request limited, and I don't even know which of the
       | other features it supports)
       | 
       | - 4o "with scheduled tasks" (why on earth is that a model and not
       | a tool that the other models can use!?)
       | 
       | Why do I have to figure all of this out myself?
        
         | fragmede wrote:
         | what's hilarious to me is that I asked ChatGPT about the model
         | names and approachs and it did a better job than they have.
        
         | resters wrote:
         | I use them as follows:
         | 
         | o1-pro: anything important involving accuracy or reasoning.
         | Does the best at accomplishing things correctly in one go even
         | with lots of context.
         | 
         | deepseek R1: anything where I want high quality non-academic
         | prose or poetry. Hands down the best model for these. Also very
         | solid for fast and interesting analytical takes. I love
         | bouncing ideas around with R1 and Grok-3 bc of their fast
         | responses and reasoning. I think R1 is the most creative yet
         | also the best at mimicking prose styles and tone. I've
         | speculated that Grok-3 is R1 with mods and think it's
         | reasonably likely.
         | 
         | 4o: image generation, occasionally something else but never for
         | code or analysis. Can't wait till it can generate accurate
         | technical diagrams from text.
         | 
         | o3-mini-high and grok-3: code or analysis that I don't want to
         | wait for o1-pro to complete.
         | 
         | claude 3.7: occasionally for code if the other models are
         | making lots of errors. Sometimes models will anchor to outdated
         | information in spite of being informed of newer information.
         | 
         | gemini models: occasionally I test to see if they are
         | competitive, so far not really, though I sense they are good at
         | certain things. Excited to try 2.5 Deep Research more, as it
         | seems promising.
         | 
         | Perplexity: discontinued subscription once the search
         | functionality in other models improved.
         | 
         | I'm really looking forward to o3-pro. Let's hope it's available
         | soon as there are some things I'm working on that are on hold
         | waiting for it.
        
           | motoboi wrote:
           | You probably know this but it can already generate accurate
           | diagrams. Just ask for the output in a diagram language like
           | mermaid or graphviz
        
             | bangaladore wrote:
             | My experience is it often produces terrible diagrams.
             | Things clearly overlap, lines make no sense. I'm not
             | surprised as if you told me to layout a diagram in XML/YAML
             | there would be obvious mistakes and layout issues.
             | 
             | I'm not really certain a text output model can ever do well
             | here.
        
               | resters wrote:
               | FWIW I think a multimodal model could be trained to do
               | extremely well with it given sufficient training data. A
               | combination of textual description of the system and/or
               | diagram, source code (mermaid, SVG, etc.) for the
               | diagram, and the resulting image, with training to
               | translate between all three.
        
               | bangaladore wrote:
               | Agreed. Even simply I'm sure a service like this already
               | exists (or could easily exist) where the workflow is
               | something like:
               | 
               | 1. User provides information
               | 
               | 2. LLM generates structured output for whatever modeling
               | language
               | 
               | 3. Same or other multimodal LLM reviews the generated
               | graph for styling / positioning issues and ensure its
               | matches user request.
               | 
               | 4. LLM generates structured output based on the feedback.
               | 
               | 5. etc...
               | 
               | But you could probably fine-tune a multimodal model to do
               | it in one shot, or way more effectively.
        
             | resters wrote:
             | I've had mixed and inconsistent results and it hasn't been
             | able to iterate effectively when it gets close. Could be
             | that I need to refine my approach to prompting. I've tried
             | mermaid and SVG mostly, but will also try graphviz based on
             | your suggestion.
        
           | shortcord wrote:
           | Gemini 2.5 Pro is quite good at code.
           | 
           | Has become my go to for use in Cursor. Claude 3.7 needs to be
           | restrained too much.
        
         | throwup238 wrote:
         | _> - Deep Research (very powerful, but I have only 10 attempts
         | per month, so I end up using roughly zero)_
         | 
         | Same here, which is a real shame. I've switched to DeepResearch
         | with Gemini 2.5 Pro over the last few days where paid users
         | have a limit 20/day limit instead of 10/month and it's been
         | great, especially since now Gemini seems to browse 10x more
         | pages than OpenAI Deep Research (on the order of 200-400 pages
         | versus 20-40).
         | 
         | The reports are too verbose but having it research random
         | development ideas, or how to do something particularly complex
         | with a specific library, or different approaches or
         | architectures to a problem has been very productive without
         | sliding into vibe coding territory.
        
         | cafeinux wrote:
         | > 4.5 (better in creative writing, and probably warmer sound
         | thanks to being vinyl based and using analog tube amplifiers,
         | but slower and request limited, and I don't even know which of
         | the other features it supports)
         | 
         | Is that an LLM hallucination?
        
       | htrp wrote:
       | anyone want to guess parameter sizes here for
       | 
       | GPT-4.1, GPT-4.1 mini GPT-4.1 nano
       | 
       | I'll start with
       | 
       | 800 bn MoE (probably 120 bn activated), 200 bn MoE (33 bn
       | activated), and 7bn parameter for nano
        
       | i_love_retros wrote:
       | I feel overwhelmed
        
       | Ninjinka wrote:
       | I've been using it in Cursor for the past few hours and prefer it
       | to Sonnet 3.7. It's much faster and doesn't seem to make the sort
       | of stupid mistakes Sonnet has been making recently.
        
       | omneity wrote:
       | I have been trying GPT-4.1 for a few hours by now through Cursor
       | on a fairly complicated code base. For reference, my gold
       | standard for a coding agent is Claude Sonnet 3.7 despite its
       | tendency to diverge and lose focus.
       | 
       | My take aways:
       | 
       | - This is the first model from OpenAI that feels relatively
       | agentic to me (o3-mini sucks at tool use, 4o just sucks). It
       | seems to be able to piece together several tools to reach the
       | desired goal and follows a roughly coherent plan.
       | 
       | - There is still more work to do here. Despite OpenAI's
       | cookbook[0] and some prompt engineering on my side, GPT-4.1 stops
       | quickly to ask questions, getting into a quite useless "convo
       | mode". Its tool calls fails way too often as well in my opinion.
       | 
       | - It's also able to handle significantly less complexity than
       | Claude, resulting in some comical failures. Where Claude would
       | create server endpoints, frontend components and routes and
       | connect the two, GPT-4.1 creates simplistic UI that calls a mock
       | API despite explicit instructions. When prompted to fix it, it
       | went haywire and couldn't handle the multiple scopes involved in
       | that test app.
       | 
       | - With that said, within all these parameters, it's much less
       | unnerving than Claude and it sticks to the request, as long as
       | the request is not too complex.
       | 
       | My conclusion: I like it, and totally see where it shines, narrow
       | targeted work, adding to Claude 3.7 - for creative work, and
       | Gemini 2.5 Pro for deep complex tasks. GPT-4.1 does feel like a
       | smaller model compared to these last two, but maybe I just need
       | to use it for longer.
       | 
       | 0: https://cookbook.openai.com/examples/gpt4-1_prompting_guide
        
         | ttul wrote:
         | I feel the same way about these models as you conclude. Gemini
         | 2.5 is where I paste whole projects for major refactoring
         | efforts or building big new bits of functionality. Claude 3.7
         | is great for most day to day edits. And 4.1 okay for small
         | things.
         | 
         | I hope they release a distillation of 4.5 that uses the same
         | training approach; that might be a pretty decent model.
        
       | bli940505 wrote:
       | Does this mean that the o1 and o3-mini models are also using 4.1
       | as the base now?
        
       ___________________________________________________________________
       (page generated 2025-04-14 23:00 UTC)