hngopher.com

       [HN Gopher] FrontierCode
       ___________________________________________________________________
        
       FrontierCode
        
       Author : streamer45
       Score  : 141 points
       Date   : 2026-06-08 20:45 UTC (9 hours ago)
        
 (HTM) web link (cognition.ai)
 (TXT) w3m dump (cognition.ai)
        
       | swyx wrote:
       | :wave: i was on the team! AMA.
       | 
       | some headlines
       | 
       | - 3000 rubrics on code quality. First benchmark to measure:
       | "would this code get actually merged?"
       | 
       | - 20+ expert open-source maintainer created tasks on their own
       | repos to capture their opinion & taste.
       | 
       | - total 1000+ hours of real life software maintainer work
       | captured in dataset. ON TOP of that, 40+ hours of real human work
       | to turn that real life work into well validated and structured
       | tasks with rubrics (even more work to turn tasks/prompts from
       | devin-infra-specific to pluggable coding agent)
       | 
       | - results in 81% lower false positive rate than SWE-Bench Pro
       | 
       | - High quality bar: many QA stages & each task manually reviewed
       | by Cognition researchers (examples in post)
       | 
       | Opus 4.8 scores 13% on FrontierCode Diamond.
       | 
       | one of my goals was also to datamine interesting stuff even on
       | the easy tasks. for example, if you squint you can see the answer
       | to "WTF Happened in late 2025" with coding models:
       | https://x.com/swyx/status/2064081945567580323
        
         | great_psy wrote:
         | How do you measure quality at scale ? Is there another model
         | that determines if it adheres to codebase standard ?
        
           | swyx wrote:
           | see Beyond Unit Tests and Novel Grading Methods in TFA.
           | 
           | i think something like ~60% llm as judge rubrics and the rest
           | as described. every rubric validated by maintainer. 3000
           | rubrics
        
         | tedsanders wrote:
         | Very cool! So glad to see people building and sharing evals
         | that are better than SWE bench.
         | 
         | I'm curious - any particular reason you didn't put error bars
         | on the graphs? Seems like it could be helpful when there are
         | only 50 unique problems in the diamond set.
        
           | swyx wrote:
           | *50 unique problems but 20-40 rubrics per problem (something
           | I had to keep reminding people internally who were
           | unimpressed with the N)
           | 
           | simple answer is our reporting was pass@5. feel like you'd
           | need like 50+ runs to have reasonable confidence intervals,
           | which somehow i dont see other people do, so i also didnt
           | insist on it.
           | 
           | hoping to work with <prominent third party evals shop> to get
           | this on their infra and evaluated along with whatever the
           | industry standard is.
        
             | tedsanders wrote:
             | Makes sense, thanks. I suppose error bars are tricky if
             | trying to handle problem-to-problem variance, rubric-to-
             | rubric variance, and run-to-run variance all at once.
        
         | typs wrote:
         | What did you do around cross-harness testing? I don't see
         | anything in the blog post about what harnesses were used in
         | evaluation. SOTA benchmarks have consistently shown that
         | frontier model performance is quite sensitive to what tools are
         | exposed (e.g. str_replace vs. apply_patch) as the labs are
         | RLing on their own harnesses. Did you do testing of the models
         | in a standard setup or in their native harnesses?
        
           | swyx wrote:
           | yes well aware :) numbers shown are on "house" harnesses eg
           | codex with gpt and claude code with opus.
           | 
           | fwiw we have examples of each model doing better on NON-house
           | harnesses too - speaking jsut for myself i think the "the
           | labs are RLing on their own harnesses" narrative is kinda
           | overstated if you think through wanting to have any
           | meaningful api business (often eg the labs will give guidance
           | on what is prefered and the agent labs can easily match tool
           | contract to that, which is to say, the "home turf advantage"
           | isnt as large as you think it is if you try a little bit)
        
             | chris_st wrote:
             | What "non-house" harnesses have you found to work best?
        
             | Bolwin wrote:
             | What is the "house" harness for minimax? They haven't
             | released any
        
         | glerk wrote:
         | This looks really great, more thoughtful than any benchmark
         | that I've seen until now!
         | 
         | I'm curious if you're only interested in scoring frontier
         | models or you would accept submission from custom harnesses? I
         | am working on multi-model harnesses and would love to test them
         | against your benchmark. Do you plan on releasing the tasks
         | publicly?
        
           | swyx wrote:
           | > Do you plan on releasing the tasks publicly?
           | 
           | yep
        
         | fouc wrote:
         | I'm a bit disappointed that Opus 4.6 wasn't in this because the
         | tokenizer changed quite a bit from 4.7 onward. I was so annoyed
         | by 4.7 that I've been forcing 4.6 ever since. I've been annoyed
         | by 4.8 a bit too, so I haven't felt the urge to move on.
        
       | singpolyma3 wrote:
       | Since no one knows or can agree on what "code quality" is and we
       | can't measure it for human output, I'm dubious about measuring it
       | for LLMs
        
         | kube-system wrote:
         | You don't need universal consensus to measure something. There
         | are many good quality measures of code quality.
        
       | einpoklum wrote:
       | > Today's coding benchmarks have established that models can
       | write correct code.
       | 
       | I wouldn't say that.
       | 
       | > But as AI-generated code becomes the dominant path to
       | production
       | 
       | I really hope that's not the case.
        
         | zakisaad wrote:
         | How do you define "correct" code?
        
           | newsicanuse wrote:
           | The code that gets stuff done instead of beating around the
           | bush making unxpected errors
        
             | vanuatu wrote:
             | i suspect this is highly dependent on what you're working
             | on
             | 
             | from my experience if you give the models a way to self-
             | verify correctness they succeed basically 100% of the time
        
       | vessenes wrote:
       | This looks great. Well reasoned, tons of work put into eval,
       | thanks for building it.
       | 
       | It strikes me as kind of wild that good evals can drive tens to
       | hundreds of millions of dollars of compute deployment in the wild
       | -- there's something new and collaborative and competitive about
       | the eval / frontier model race that's quite interesting..
       | 
       | In this case "shorter actually mergable patches that open source
       | maintainers would accept" feels like a great thing to deliver to
       | the world.
       | 
       | I didn't deep dive into good and bad patches, but I wonder if
       | swyx or others on the team have predictions on saturation. Both
       | when, and how useful will it be? That is, do you guys think this
       | test is broad enough as written to get better behavior out of
       | models, and if there is saturation on this test, will we see
       | generalized better patch / coding behavior?
        
         | swyx wrote:
         | thanks - credit to silas, eric, ben, and team for the depth of
         | the evals, and the rest of the research team for doing the
         | transcript reading parties lol
         | 
         | by nature of being based on open source, frontiercode public
         | will saturate very very quickly. frontiercode main will be >80%
         | in less than a year. hopefully diamond will last a bit longer.
         | we can do annual refreshes, thats not my strategy for staying
         | relevant - what i'm more excited to get funding for is private
         | held out version of frontiercode based on repros of real
         | enterprise customer problems. in an ideal agent lab
         | (https://latent.space/p/agent-labs) you meticulously build up
         | this domain understanding and that is essentially why both
         | model labs and serious customers come to you.
        
       | Topfi wrote:
       | Great effort and a bit closer to my private evals than DeepSWE. I
       | greatly appreciate the focus on false negative and positives,
       | along with simply being far more focused on actual, mergeable
       | quality output over plain passing. Could see a lot of others
       | adopt your list of metrics as a basis, they are very well defined
       | and solid coverage of everything one should want out of code
       | provided, not just focused on one or two narrow targets. Will
       | incorporate a lot of these ideas in my own tests and polish some
       | other parts where I somewhat unintentionally already went into a
       | roughly similar direction.
        
       | ilaksh wrote:
       | Is there anything we can download? Did they test GLM 5.1?
        
       | nullbio wrote:
       | This isn't a fair way to chart this:
       | 
       | "Each model is run 5 times at every available reasoning effort.
       | For each effort, we average the metric across the 5 trials, then
       | report each model's score at its best performing reasoning
       | level."
       | 
       | For example, Anthropic's "medium" might involve 3x the amount of
       | thinking and take 5x as long as OpenAI's idea of "medium". So now
       | you've skewed all the results. It assumes that they're linear and
       | equivalent ranges.
       | 
       | You should compare apples to apples. Weight them in a way that
       | factors in total task completion time as the measure of "effort",
       | not the arbitrary effort settings provided by the AI company. I
       | don't care what the underlying effort level is, I care which
       | model out of multiple, if running for the same amount of time,
       | completes my task to a more accurate degree. Total token
       | consumption would also be another thing to consider as well, to
       | rule out TPS. But generally, if the goal is ultimate
       | productivity, the main factor is what does it faster. If cost is
       | a concern factor then token count+speed, or token count alone, is
       | the main factor.
       | 
       | The second chart paints a more clear picture though, GPT 5.5
       | xhigh gets 44.7% at 21k tokens, and Opus 4.8 max gets 49.9% at
       | 75k tokens. So basically, 4x the amount of tokens from Opus 4.8
       | resulted in an increase of 5.2%. If you were to loop GPT 5.5
       | xhigh over the same set of tasks, an extra 4x, would it surpass
       | the 49.9%? That's the real question here. And I'd wager it
       | probably would.
       | 
       | But the framing of this whole thing makes it sound like Opus has
       | some massive lead. In reality though, it just loops harder and
       | consumes more tokens. Their effort levels are not equivalent.
       | 
       | Now take this even further, and emulate what Anthropic is likely
       | doing behind the scenes. Running the prompt through multiple
       | prompts and converging on the end result. Give GPT 4 generic
       | skills that cover different aspects of the benchmark in a general
       | way. Run it 4x to get that same token count usage, and use each
       | of those different skills for each one. Now what is the result?
       | I'd wager it blows Opus out of the water.
       | 
       | The end result is this: Anthropic gives you all of the bloat in a
       | single, slow package. GPT gives you the ability to build your own
       | equivalent harness. I'd much rather have the freedom and
       | flexibility to do it myself.
       | 
       | Once people actually focus on building strong harnesses around
       | open-source, we'll have models that are competing at the same
       | level as the closed labs. Especially now that we have models like
       | Nemotron 3 Ultra. But it involves a lot of clever approaches,
       | like using small fast models to help with routing and determining
       | what "skills" and prompts to load, using static analysis, local
       | tools and vector databases. Using a pipeline of all of the
       | specialized, fast, small models to handle the various aspects of
       | the specific task in a cooperative tree. The amount of
       | underutilized specialized AI models out there is insane, no one
       | seems to be building harnesses around them. Things like semantic
       | code duplication detection for example. We don't need to be using
       | the big model to do everything, the big model should be the
       | orchestrator of all of the tools and little models.
       | 
       | This is why the big labs have a lead that no one seems to be able
       | to crack, because they're not just building a model and calling
       | it a day, they utilize all of these other approaches on top of
       | the big model. Now that we have strong open source models, we can
       | start building these things too.
        
         | edg5000 wrote:
         | Wow, looks like you've found a massive flaw indeed.
         | 
         | I was skeptical about the results because in my experience both
         | recent GPT and Opus modules are strong. Everything else is B or
         | C tier. This is just artisanal vibe testing though. It's very
         | hard to eval them properly.
        
       | 2001zhaozhao wrote:
       | You know that it's a honest benchmark when their own model
       | (SWE-1.6) scores terrible on it.
        
       | Magniquick wrote:
       | Opus 4.8 low at 8.2% while medium at 5.9% is definitely an
       | interesting result, to say the least.
        
       | twotwotwo wrote:
       | I'm liking the effort to make new, no-longer-saturated
       | benchmarks. I'll also be a bit suspicious if some model aces it
       | -- matching OSS maintainers' taste _more often_ is a plausible
       | improvement in quality but if they nail it _every time_ they 've
       | been memorizing.
       | 
       | Not saying FrontierCode should've done this, but benchmarking
       | _the interaction_ would be interesting. That is, if I get a diff
       | with a blocking problem but writing a comment gets fixed, that 's
       | a _lot_ different from if the model has hit a wall. Better, if
       | there 's a problem but the model flagged it in a short list of
       | questions or worries to me before or after coding, it can get
       | sorted without taking much of my time. Stick an LLM in the loop
       | instructed to behave like a user or reviewer with some rubric-ish
       | info that wasn't in the prompt. Then, look at how much the
       | pretend user has to do to get to a quality result with a given
       | model, if they can get to one at all.
       | 
       | You could say 'why worry about interaction? the goal is the model
       | just gets it perfect' but I think that imagined end state just is
       | not a thing: tasks will get bigger but there will still be
       | interaction. Handling comments and asking good clarifying
       | questions when needed are real capabilities. Human SWEs interact
       | plenty and real engineering has a certain density of questions
       | about requirements, taste, and other big vague things.
        
       ___________________________________________________________________
       (page generated 2026-06-09 06:00 UTC)