[HN Gopher] FrontierCode
___________________________________________________________________
FrontierCode
Author : streamer45
Score : 141 points
Date : 2026-06-08 20:45 UTC (9 hours ago)
(HTM) web link (cognition.ai)
(TXT) w3m dump (cognition.ai)
| swyx wrote:
| :wave: i was on the team! AMA.
|
| some headlines
|
| - 3000 rubrics on code quality. First benchmark to measure:
| "would this code get actually merged?"
|
| - 20+ expert open-source maintainer created tasks on their own
| repos to capture their opinion & taste.
|
| - total 1000+ hours of real life software maintainer work
| captured in dataset. ON TOP of that, 40+ hours of real human work
| to turn that real life work into well validated and structured
| tasks with rubrics (even more work to turn tasks/prompts from
| devin-infra-specific to pluggable coding agent)
|
| - results in 81% lower false positive rate than SWE-Bench Pro
|
| - High quality bar: many QA stages & each task manually reviewed
| by Cognition researchers (examples in post)
|
| Opus 4.8 scores 13% on FrontierCode Diamond.
|
| one of my goals was also to datamine interesting stuff even on
| the easy tasks. for example, if you squint you can see the answer
| to "WTF Happened in late 2025" with coding models:
| https://x.com/swyx/status/2064081945567580323
| great_psy wrote:
| How do you measure quality at scale ? Is there another model
| that determines if it adheres to codebase standard ?
| swyx wrote:
| see Beyond Unit Tests and Novel Grading Methods in TFA.
|
| i think something like ~60% llm as judge rubrics and the rest
| as described. every rubric validated by maintainer. 3000
| rubrics
| tedsanders wrote:
| Very cool! So glad to see people building and sharing evals
| that are better than SWE bench.
|
| I'm curious - any particular reason you didn't put error bars
| on the graphs? Seems like it could be helpful when there are
| only 50 unique problems in the diamond set.
| swyx wrote:
| *50 unique problems but 20-40 rubrics per problem (something
| I had to keep reminding people internally who were
| unimpressed with the N)
|
| simple answer is our reporting was pass@5. feel like you'd
| need like 50+ runs to have reasonable confidence intervals,
| which somehow i dont see other people do, so i also didnt
| insist on it.
|
| hoping to work with <prominent third party evals shop> to get
| this on their infra and evaluated along with whatever the
| industry standard is.
| tedsanders wrote:
| Makes sense, thanks. I suppose error bars are tricky if
| trying to handle problem-to-problem variance, rubric-to-
| rubric variance, and run-to-run variance all at once.
| typs wrote:
| What did you do around cross-harness testing? I don't see
| anything in the blog post about what harnesses were used in
| evaluation. SOTA benchmarks have consistently shown that
| frontier model performance is quite sensitive to what tools are
| exposed (e.g. str_replace vs. apply_patch) as the labs are
| RLing on their own harnesses. Did you do testing of the models
| in a standard setup or in their native harnesses?
| swyx wrote:
| yes well aware :) numbers shown are on "house" harnesses eg
| codex with gpt and claude code with opus.
|
| fwiw we have examples of each model doing better on NON-house
| harnesses too - speaking jsut for myself i think the "the
| labs are RLing on their own harnesses" narrative is kinda
| overstated if you think through wanting to have any
| meaningful api business (often eg the labs will give guidance
| on what is prefered and the agent labs can easily match tool
| contract to that, which is to say, the "home turf advantage"
| isnt as large as you think it is if you try a little bit)
| chris_st wrote:
| What "non-house" harnesses have you found to work best?
| Bolwin wrote:
| What is the "house" harness for minimax? They haven't
| released any
| glerk wrote:
| This looks really great, more thoughtful than any benchmark
| that I've seen until now!
|
| I'm curious if you're only interested in scoring frontier
| models or you would accept submission from custom harnesses? I
| am working on multi-model harnesses and would love to test them
| against your benchmark. Do you plan on releasing the tasks
| publicly?
| swyx wrote:
| > Do you plan on releasing the tasks publicly?
|
| yep
| fouc wrote:
| I'm a bit disappointed that Opus 4.6 wasn't in this because the
| tokenizer changed quite a bit from 4.7 onward. I was so annoyed
| by 4.7 that I've been forcing 4.6 ever since. I've been annoyed
| by 4.8 a bit too, so I haven't felt the urge to move on.
| singpolyma3 wrote:
| Since no one knows or can agree on what "code quality" is and we
| can't measure it for human output, I'm dubious about measuring it
| for LLMs
| kube-system wrote:
| You don't need universal consensus to measure something. There
| are many good quality measures of code quality.
| einpoklum wrote:
| > Today's coding benchmarks have established that models can
| write correct code.
|
| I wouldn't say that.
|
| > But as AI-generated code becomes the dominant path to
| production
|
| I really hope that's not the case.
| zakisaad wrote:
| How do you define "correct" code?
| newsicanuse wrote:
| The code that gets stuff done instead of beating around the
| bush making unxpected errors
| vanuatu wrote:
| i suspect this is highly dependent on what you're working
| on
|
| from my experience if you give the models a way to self-
| verify correctness they succeed basically 100% of the time
| vessenes wrote:
| This looks great. Well reasoned, tons of work put into eval,
| thanks for building it.
|
| It strikes me as kind of wild that good evals can drive tens to
| hundreds of millions of dollars of compute deployment in the wild
| -- there's something new and collaborative and competitive about
| the eval / frontier model race that's quite interesting..
|
| In this case "shorter actually mergable patches that open source
| maintainers would accept" feels like a great thing to deliver to
| the world.
|
| I didn't deep dive into good and bad patches, but I wonder if
| swyx or others on the team have predictions on saturation. Both
| when, and how useful will it be? That is, do you guys think this
| test is broad enough as written to get better behavior out of
| models, and if there is saturation on this test, will we see
| generalized better patch / coding behavior?
| swyx wrote:
| thanks - credit to silas, eric, ben, and team for the depth of
| the evals, and the rest of the research team for doing the
| transcript reading parties lol
|
| by nature of being based on open source, frontiercode public
| will saturate very very quickly. frontiercode main will be >80%
| in less than a year. hopefully diamond will last a bit longer.
| we can do annual refreshes, thats not my strategy for staying
| relevant - what i'm more excited to get funding for is private
| held out version of frontiercode based on repros of real
| enterprise customer problems. in an ideal agent lab
| (https://latent.space/p/agent-labs) you meticulously build up
| this domain understanding and that is essentially why both
| model labs and serious customers come to you.
| Topfi wrote:
| Great effort and a bit closer to my private evals than DeepSWE. I
| greatly appreciate the focus on false negative and positives,
| along with simply being far more focused on actual, mergeable
| quality output over plain passing. Could see a lot of others
| adopt your list of metrics as a basis, they are very well defined
| and solid coverage of everything one should want out of code
| provided, not just focused on one or two narrow targets. Will
| incorporate a lot of these ideas in my own tests and polish some
| other parts where I somewhat unintentionally already went into a
| roughly similar direction.
| ilaksh wrote:
| Is there anything we can download? Did they test GLM 5.1?
| nullbio wrote:
| This isn't a fair way to chart this:
|
| "Each model is run 5 times at every available reasoning effort.
| For each effort, we average the metric across the 5 trials, then
| report each model's score at its best performing reasoning
| level."
|
| For example, Anthropic's "medium" might involve 3x the amount of
| thinking and take 5x as long as OpenAI's idea of "medium". So now
| you've skewed all the results. It assumes that they're linear and
| equivalent ranges.
|
| You should compare apples to apples. Weight them in a way that
| factors in total task completion time as the measure of "effort",
| not the arbitrary effort settings provided by the AI company. I
| don't care what the underlying effort level is, I care which
| model out of multiple, if running for the same amount of time,
| completes my task to a more accurate degree. Total token
| consumption would also be another thing to consider as well, to
| rule out TPS. But generally, if the goal is ultimate
| productivity, the main factor is what does it faster. If cost is
| a concern factor then token count+speed, or token count alone, is
| the main factor.
|
| The second chart paints a more clear picture though, GPT 5.5
| xhigh gets 44.7% at 21k tokens, and Opus 4.8 max gets 49.9% at
| 75k tokens. So basically, 4x the amount of tokens from Opus 4.8
| resulted in an increase of 5.2%. If you were to loop GPT 5.5
| xhigh over the same set of tasks, an extra 4x, would it surpass
| the 49.9%? That's the real question here. And I'd wager it
| probably would.
|
| But the framing of this whole thing makes it sound like Opus has
| some massive lead. In reality though, it just loops harder and
| consumes more tokens. Their effort levels are not equivalent.
|
| Now take this even further, and emulate what Anthropic is likely
| doing behind the scenes. Running the prompt through multiple
| prompts and converging on the end result. Give GPT 4 generic
| skills that cover different aspects of the benchmark in a general
| way. Run it 4x to get that same token count usage, and use each
| of those different skills for each one. Now what is the result?
| I'd wager it blows Opus out of the water.
|
| The end result is this: Anthropic gives you all of the bloat in a
| single, slow package. GPT gives you the ability to build your own
| equivalent harness. I'd much rather have the freedom and
| flexibility to do it myself.
|
| Once people actually focus on building strong harnesses around
| open-source, we'll have models that are competing at the same
| level as the closed labs. Especially now that we have models like
| Nemotron 3 Ultra. But it involves a lot of clever approaches,
| like using small fast models to help with routing and determining
| what "skills" and prompts to load, using static analysis, local
| tools and vector databases. Using a pipeline of all of the
| specialized, fast, small models to handle the various aspects of
| the specific task in a cooperative tree. The amount of
| underutilized specialized AI models out there is insane, no one
| seems to be building harnesses around them. Things like semantic
| code duplication detection for example. We don't need to be using
| the big model to do everything, the big model should be the
| orchestrator of all of the tools and little models.
|
| This is why the big labs have a lead that no one seems to be able
| to crack, because they're not just building a model and calling
| it a day, they utilize all of these other approaches on top of
| the big model. Now that we have strong open source models, we can
| start building these things too.
| edg5000 wrote:
| Wow, looks like you've found a massive flaw indeed.
|
| I was skeptical about the results because in my experience both
| recent GPT and Opus modules are strong. Everything else is B or
| C tier. This is just artisanal vibe testing though. It's very
| hard to eval them properly.
| 2001zhaozhao wrote:
| You know that it's a honest benchmark when their own model
| (SWE-1.6) scores terrible on it.
| Magniquick wrote:
| Opus 4.8 low at 8.2% while medium at 5.9% is definitely an
| interesting result, to say the least.
| twotwotwo wrote:
| I'm liking the effort to make new, no-longer-saturated
| benchmarks. I'll also be a bit suspicious if some model aces it
| -- matching OSS maintainers' taste _more often_ is a plausible
| improvement in quality but if they nail it _every time_ they 've
| been memorizing.
|
| Not saying FrontierCode should've done this, but benchmarking
| _the interaction_ would be interesting. That is, if I get a diff
| with a blocking problem but writing a comment gets fixed, that 's
| a _lot_ different from if the model has hit a wall. Better, if
| there 's a problem but the model flagged it in a short list of
| questions or worries to me before or after coding, it can get
| sorted without taking much of my time. Stick an LLM in the loop
| instructed to behave like a user or reviewer with some rubric-ish
| info that wasn't in the prompt. Then, look at how much the
| pretend user has to do to get to a quality result with a given
| model, if they can get to one at all.
|
| You could say 'why worry about interaction? the goal is the model
| just gets it perfect' but I think that imagined end state just is
| not a thing: tasks will get bigger but there will still be
| interaction. Handling comments and asking good clarifying
| questions when needed are real capabilities. Human SWEs interact
| plenty and real engineering has a certain density of questions
| about requirements, taste, and other big vague things.
___________________________________________________________________
(page generated 2026-06-09 06:00 UTC)