[HN Gopher] Some critical issues with the SWE-bench dataset
       ___________________________________________________________________
        
       Some critical issues with the SWE-bench dataset
        
       Author : joshwa
       Score  : 287 points
       Date   : 2025-02-21 17:59 UTC (5 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | acc_297 wrote:
       | > 32.67% of the successful patches involve cheating as the
       | solutions were directly provided in the issue report or the
       | comments.
       | 
       | Is this what Hofstadter means by a strange-loop?
        
         | andrepd wrote:
         | Turns out "AI deep research reasoning agent" was just "we can
         | print the training set"
        
           | thegeomaster wrote:
           | ...by piping it through the world's most inefficient echo
           | function.
        
             | xrd wrote:
             | You should immediately publish a paper on Arvix with your
             | revolutionary IEF brand, an improvement on transformers and
             | mamba architectures. Then, like Ilya, take $1B in funding
             | the following week.
        
             | sva_ wrote:
             | That reminds me of someone calling the Bitcoin blockchain
             | the most expensive linked list in the world.
        
               | wongarsu wrote:
               | The difference is that Bitcoin is designed to be "just"
               | an append-only* timestamped linked list, with some rules
               | on how a new node can look like in order to be
               | successfully appended. Making the creation of a canonical
               | linked list possible between hostile actors is the whole
               | innovation. The currency stuff is "just" a cool practical
               | application tacked on the linked list
               | 
               | LLMs by contrast are not designed to just repeat what's
               | already in the instructions, no matter which stance on
               | LLM design you subscribe to
               | 
               | * exceptions apply
        
       | modeless wrote:
       | > When we filtered out these problematic issues, the resolution
       | rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%.
       | 
       | This matches my intuition about the coding performance of these
       | models a lot better. I don't think any current coding benchmark
       | accurately measures coding performance.
        
         | alfalfasprout wrote:
         | Yep anecdotally that's basically spot-on. It's also one of the
         | reasons that I still find copilot vastly more useful than
         | highly autonomous AI tooling (cursor, roocode, avante, etc.)
        
         | OsrsNeedsf2P wrote:
         | Anecdotal but I was always shocked to see Claude 3.5 perform so
         | poorly in the benchmarks, when it generates 80% of my code in
         | Cursor (and in cases it fails, no other model succeeds)
        
           | TheDong wrote:
           | Different people seem to get wildly different results here,
           | and I'm not sure what percentage is down to the type of
           | software being built vs the usage patterns.
           | 
           | In my case, I would guess less than 10% of the code I get out
           | of AIs is useful.
           | 
           | What sort of code are you getting those results with? Is it
           | yet-another-react-frontend-button? Is it ebpf programs? Is it
           | a parser in rust?
           | 
           | For the latter two, I've found AI to have pretty low rates,
           | and for the former I haven't had the desire to try.
        
             | aprilthird2021 wrote:
             | My gut tells me the AIs will be best for small web projects
             | that are greenfield. The kind a 1-3 person team could
             | maintain.
             | 
             | And my gut tells me they are the worst for the kinds of
             | long-established software conglomerates many professionals
             | work at, which have tons of internal services, integrated
             | acquisitions, etc. etc.
             | 
             | Ultimately the AI is good at what the average developer
             | online is good at, probably full-stack web dev of projects
             | from scratch.
        
               | dingnuts wrote:
               | but that kind of code is so easy to write, and code is
               | already way more terse than natural language! it's
               | literally more typing to explain to an LLM how to write
               | some greenfield web CRUD than it is to just type out the
               | code, and if there's a lot of boilerplate it's faster to
               | generate the repetitive parts with keyboard macros!
               | 
               | where's the value everyone on this site and on LinkedIn
               | (but NONE in my real or professional life) seems to get?
               | 
               | I feel like I'm being gaslit when people say Cursor
               | writes 80% of their code, and honestly, it's the
               | conclusion that makes the most sense to me -- the people
               | making these posts must be well-invested in the startups
               | that stand to profit if AI is actually as good as they
               | say. You know, shills.
        
               | pdntspa wrote:
               | I work on web crawlers and data mining at scale and well
               | over 50% of my code output is written by AI. I use mostly
               | o1 (copying and pasting isolated snippets) or Jetbrains'
               | AI service.
               | 
               | I also have access to a full-service "junior developer"
               | AI that can take in an entire git repo at once, and its
               | code outputs are significantly less useful -- maybe 10%.
               | 
               | I think a lot of peoples' success rate with AI boils down
               | to their choices in language/toolkit (AI does much better
               | the more common it is) and how they prompt it.
               | 
               | Note that you still need an experienced set of eyes
               | supervising, the thought of an LLM committing to a git
               | repo without a human in the loop scares me.
        
               | botanical76 wrote:
               | Have you tried the AI intellisense models like Copilot?
               | 
               | I don't understand the notion that it is faster to
               | generate repetitive code with keyboard macros. I use Vim-
               | mode exclusively, and while I'm not a Vim master, I don't
               | think there's any set of macros that will do what Copilot
               | can do.
               | 
               | It's not that Copilot is smart. It's that 60% of what I
               | do doesn't require much intelligence to anticipate. It is
               | the 40% that matters, the remainder can be trivially
               | guessed, and this is exactly what Copilot does.
               | 
               | Maybe this will help: you need to imagine with an AI
               | intellisense that with each keystroke, you are collapsing
               | the possibility space down to a smaller, finite number of
               | outcomes. You write exactly what code you need for the
               | dumb AI to predict the rest of it.
               | 
               | There are a LOT of reasons why AI intellisense is not all
               | there yet; it can be distracting; it can try to generate
               | too much at once; none of the tools have LSP integrated,
               | so it will provide bullshit suggestions of library
               | methods that don't exist. This is all true, and yet it is
               | still highly valuable in some domains, for some people.
               | 
               | That said, if you write x86 assembly for a living, you
               | are probably out of luck.
               | 
               | (I write Kotlin, Java for Android apps and services, C++
               | that is tightly integrated with the SoC. Python and Bash
               | for command-line tools that invoke REST APIs. Copilot is
               | useful for these domains.)
        
               | doug_durham wrote:
               | Your intuition runs counter to most folks experience. I
               | work on complex machine learning training loops and loss
               | functions. LLMs work great on that.
        
               | pitpatagain wrote:
               | I work in machine learning research: training loops and
               | loss functions are incredibly repetitive and pattern
               | filled, highly represented in the code the LLMs are
               | trained on, and typically short. They are exactly my
               | intuition of simple code that LLMs would work well on.
        
             | Philip-J-Fry wrote:
             | I'm pretty confident in my ability to write any code in my
             | main language. But AI is still very useful in just filling
             | out boiler plate, or noticing a pattern and filling out the
             | rest of some repetitive code. Or say, I need to write
             | wrapper around a common command-line utility. It's pretty
             | good at generating the code for that.
             | 
             | What I mostly enjoy using it for is just writing bash
             | scripts for me. I hate writing bash but Claude is excellent
             | at writing the scripts I need.
             | 
             | AI isn't writing software features or anything close to
             | that for me at the moment. But what it is great at is just
             | being a really excellent intellisense. Knowing what you're
             | likely to want to do in the next ~5 lines and just filling
             | it out in one button press. Things like intellisense and
             | automatic refactoring tools were big productivity
             | improvements when they became ubiquitous. AI will be the
             | same for most people, an intellisense on steroids.
             | 
             | Also, writing tests. Writing tests can be quite mundane and
             | boring. But I can just type out what I want tested, give it
             | some files as context and it can be pretty good at
             | generating some tests.
             | 
             | Does AI get it right every time? No way. But, as a
             | developer, I'd rather spend 10 minutes trying to coax an AI
             | into generating me 90% useable code for some boring task
             | than spend 20 minutes typing it out myself. Often, I
             | probably could write the code faster than I could prompt an
             | AI, but being lazy and telling something else to do the
             | work feels pretty good and relaxing.
        
               | ghaff wrote:
               | >AI is still very useful in just filling out boiler plate
               | 
               | That's what I tend to find with English writing as well.
               | It's not great. But sometimes you just need decent
               | generic prose for an introduction or an explanation of
               | something. If you know enough to adjust as needed, it can
               | save time for something that readers are probably just
               | skimming anyway. As I've written previously, about a year
               | ago I was working on cleaning up a bunch of reference
               | architectures and I used Google's Bard in that case to
               | give me a rough draft of background intros for some of
               | them which I modified as needed. Nothing miraculous but
               | saved me a bit of time.
        
             | alfalfasprout wrote:
             | Almost every time someone says "but most of my code
             | nowadays is LLM generated" it's usually one of three
             | things:
             | 
             | 1. Very greenfield work where the LLM doesn't _really_ have
             | a lot of constraints to deal with and can fully control the
             | setup + doesn 't have to ingest a lot of existing context
             | 2. Very small projects that largely follow established
             | patterns (CRUD, frontends, etc.) 3. Well established
             | implementation work (the kind of feature that's a simple
             | JIRA ticket).
             | 
             | In my experience they're painfully bad at:
             | 
             | - Novel/niche work where there aren't really answers online
             | to what you're trying to do - Complex refactoring -
             | Architecting within existing constraints (other systems,
             | etc.)
        
               | cyanydeez wrote:
               | I assume it's the commoditized work that made India a
               | success at outsourced activities.
        
             | ben_w wrote:
             | > For the latter two, I've found AI to have pretty low
             | rates, and for the former I haven't had the desire to try.
             | 
             | Similar. I've got a joke language project on the back
             | burner, doing it properly requires going back over my 23
             | year old university notes on yacc etc., so I tried AI...
             | the AI just makes a mess of it*.
             | 
             | For anything front end, even the original ChatGPT-3.5 model
             | is basically magic (i.e. sufficiently advanced technology).
             | 
             | * I think the last time I touched it was just before o1 was
             | announced; as o3 is now in the free tier of ChatGPT, I
             | should try again...
        
             | duped wrote:
             | I think it's frontend javascript versus everything else.
             | 
             | There's a few languages/tools I use often but am not an
             | expert in and have been using Claude 3.5 to help me work
             | with existing code. On paper this is a perfect use case. In
             | practice it's like working with an intern that has google
             | in front of them and enough jargon to convince me what
             | they're saying isn't bullshit. Eventually, I'll be able to
             | coax the answers I need out of it.
             | 
             | I'll say though the fact AI can't say "I don't know" and
             | closely related "that is not possible in the context you've
             | given me" combined with the inability to reason is what
             | gives you results that look OK but are subtly trash.
        
               | cyanydeez wrote:
               | the reason AI is a giant f'n bubble is because doing an
               | exhaustive review of results means _doing the actual work
               | of millions of manhours_.
               | 
               | Instead, people squint their eyes at scrolling matrix
               | text and convince themselves it must be true.
        
             | flir wrote:
             | I strongly suspect it's a function of the amount of text it
             | was trained on.
             | 
             | The number of questions about each language/framework on
             | Stack Overflow is probably a reasonable proxy for a
             | chatbot's competence in that language.
             | 
             | (Drawing that graph might actually be a useful exercise. If
             | it outperforms expectations on really obscure languages,
             | maybe something interesting is happening).
        
           | serjester wrote:
           | I write a lot of Python and personally I find Claude
           | significantly worse than OpenAI's reasoning models. I really
           | feel like this varies a ton language to language.
        
           | dgunay wrote:
           | I've been using LLMs for tab autocomplete for a while and
           | just recently started trying out agentic coding AI (Copilot
           | Edits and Cline). I think the disappointing shortfall of
           | agentic AIs (at least for me) comes from the feedback loop
           | being so much looser than the autocomplete style. With
           | autocomplete, I don't have to actively think about what
           | context to feed it, and I can gently correct it if it goes in
           | the wrong direction on a line-by-line basis. With AI agents,
           | they have a lot more leeway to generate a ton of code and
           | reason themselves off the rails before you're able to step in
           | and correct them. Now granted, I am also not very good yet at
           | managing context and crafting prompts, but it feels a lot
           | harder to get good at than simply dropping an AI
           | autocompleter into an existing programming workflow. It's a
           | new paradigm.
        
           | rco8786 wrote:
           | That's been my experience too, but I would guess the problem
           | of "here is a ton of context, produce a small amount of code"
           | is significantly better suited for LLMs than "here is a
           | problem, produce a ton of code".
        
           | modeless wrote:
           | When I use Cursor I ask for specific small tasks that I know
           | it should be able to handle. Larger, open-ended tasks fail
           | almost universally for me.
        
         | siva7 wrote:
         | o3-mini and gpt-4o are so piss poor in agent coding compared to
         | claude that you don't even need a benchmark
        
           | jbellis wrote:
           | o3-mini-medium is slower than claude but comparable in
           | quality. o3-mini-high is even slower, but better.
        
           | danielbln wrote:
           | Claude really is a step above the rest when it comes to
           | agentic coding.
        
         | theturtletalks wrote:
         | I personally use Aider's Polyglot Benchmark [0] which is a bit
         | low-key and not gamed just yet. It matches my experience too
         | where Claude Sonnet 3.5 is the best and still beats the new
         | reasoning models like o3-mini, DeepSeek, etc.
         | 
         | 0. https://aider.chat/docs/leaderboards/
        
           | azinman2 wrote:
           | Would love if they put latency in this too.
        
           | nyrikki wrote:
           | Quite the corpus for Exercism tasks that were almost
           | certainly trained on, which could lead this to doing what we
           | know LLM/LRM's are good at...approximate retrieval.
           | 
           | https://github.com/search?q=Exercism&type=repositories
        
           | yunwal wrote:
           | Are Exercism coding exercises really low key? I thought it
           | was like the standard free platform for learning a new
           | language now
        
             | theturtletalks wrote:
             | Low-key as in many people don't check this leaderboard as
             | much as the other high profile ones.
        
           | KaoruAoiShiho wrote:
           | Sonnet is literally lower on the aider benchmark you just
           | linked. It's only the top with Deepseek as architect,
           | otherwise it's lower than many others.
        
             | refulgentis wrote:
             | Let's steelman a bit: once you multiply out the edit
             | accuracy versus completion accuracy, Sonnet, on its own, is
             | within 5% of the very top one not using sonnet.
        
             | theturtletalks wrote:
             | Yes, but I use Cursor Composer Agent mode with Sonnet which
             | is like Aider's architect mode where 1 LLM is instructing
             | another one. Not to mention the new reasoning models can't
             | use tool calling (except o3-mini which is not multi-modal).
        
               | KaoruAoiShiho wrote:
               | Me too, cursor+sonnet is also my go to, I just didn't
               | really understand what you were getting at by pointing
               | out this benchmark. I guess it is significant that Sonnet
               | is the actual line by line coder here. It is the best at
               | that, and it's better than DeepSeek+any other combination
               | and better than Any other reasoner+Sonnet.
        
               | theturtletalks wrote:
               | Yes I've followed this benchmark for a while and before
               | Deepseek + Sonnet Architect took the top spot, Sonnet was
               | there alone followed by o1 and Gemini EXP. This is one of
               | the few benchmarks where Sonnet is actually on top like
               | my experience shows, other popular ones have 03-mini and
               | DeepSeek r1 which fall short in my opinion.
        
         | delusional wrote:
         | > where the resolution rates of the models drop significantly,
         | which are 0.73%, 0.55%, and 3.83%, respectively.
         | 
         | Matches my experience pretty well as too. It'll usually output
         | something that a novice would assume is correct but an expert
         | can clearly identify as "know it all teenager forum post" level
         | stuff.
        
         | 0x20cowboy wrote:
         | It matches my experience as well.
         | 
         | I find the models very useful to chat about library
         | documentation or high level algorithm concepts, but I find the
         | code it generates to be... I don't know how else to say it...
         | really bad and often out of context.
         | 
         | I know developers who blindly follow the hype and use them to
         | generate production code. That scares the poop emoji out of me,
         | and the code reads like an asset flipped 3D game.
        
         | avs733 wrote:
         | It is worth reflecting, as much as HN seems to hate the social
         | sciences, on this point. But the difficulty of measuring
         | intelligence is a challenge that several fields have struggled
         | with for decades. It is inherently hard because defining
         | intelligence and building intelligence are very closely
         | coupled. This both makes it hard to make unbiased measures as
         | well making measures that don't affect the phenomenon basically
         | NP hard, or known as the Flynn effect[0].
         | 
         | It also goes to how a lot of people misunderstand the
         | replication crisis. 'Hard science' really should replicate - we
         | should be able filter out sources fo error and variance because
         | the phenomena (generally) isn't affected by our attempts to
         | measure it. Making social science replicate often requires so
         | much control that it is deabstracted from reality, meaning the
         | effort at replication reduces the value and usefulness of the
         | knowledge. Generalizable claims are hard because the sources of
         | variance are so much larger adn more complex. Speaking as
         | someone who went through a transition from engineering to
         | social sciences, it is the concept that made it hard. I started
         | my time in social sciences with a cool idea of a whole carrer
         | based on just doing replication studies, because science. That
         | was...useful and stupid at the same time.
         | 
         | [0] https://en.wikipedia.org/wiki/Flynn_effect
        
       | semi-extrinsic wrote:
       | So what we need is something like a versioned crowdsourced coding
       | LLM eval dataset.
       | 
       | Every quarter, you have a couple thousand volunteers provide 2
       | GitHub issues from the past 3 months, which are nontrivial to
       | resolve, and where there exists strong test cases. Each volunteer
       | then cross-checks 2 issues from other volunteers. The volunteers
       | get 1 month free subscription to some AI service in return.
       | 
       | This dataset is then published as SWE-UberBench-2025-02 or
       | something. People can then only evaluate their coding LLM on
       | datasets published after their training period.
        
         | SR2Z wrote:
         | Right, so that AI companies can freely throw this significantly
         | more valuable training data into a model and then turn around
         | and advocate for clamping down on the freedom of models.
        
         | delusional wrote:
         | And why would these "couple of thousand volunteers" help with
         | this?
        
           | rsynnott wrote:
           | And how would you ensure that all of them were really
           | volunteers and not colluding with the vendors? Like, tech
           | companies cheating on benchmarks is an old, old story
           | (personal favourite: in the dark ages, before 3D
           | acceleration, some graphics card drivers, on detecting a 2D
           | acceleration benchmark, would _simply draw the wrong thing_),
           | and I wouldn't trust at least three of the major players as
           | far as I could throw them.
        
             | delusional wrote:
             | I'm pretty sure my bios still contains an option to
             | "improve performance of 3dmark 8" or something similar.
        
         | nitwit005 wrote:
         | If you know some way to get people to volunteer millions of
         | dollars of free labor, there are better uses of their time than
         | evaluating LLMs.
        
       | otterley wrote:
       | I am shocked-- _shocked_ --when a vendor cheats in order to
       | increase their benchmark scores.
       | 
       | I always tell my customers to ignore benchmarks and compare
       | outcomes with their own workloads. Benchmarks are almost
       | completely useless in the real world.
        
         | Snuggly73 wrote:
         | I only trust benchmarks that I've faked myself :)
        
         | commandlinefan wrote:
         | Although I believe there's a lot of this going on, in this case
         | it just appears to be incompetence rather than malice.
        
         | adamc wrote:
         | I don't know why you are getting downrated. That is sane
         | advice.
        
       | optimalsolver wrote:
       | You need benchmarks with the following three properties:
       | 
       | 1) No known solutions, so there's no "ground truth" dataset to
       | train on
       | 
       | 2) Presumably hard to solve
       | 
       | 3) But easy to verify a solution if one is provided.
       | 
       | This, of course, is easier done on the STEM side of things, but
       | how do you automatically test creativity, or philosophical
       | aptitude?
        
         | hsuduebc2 wrote:
         | I guess it's purely subjective. Maybe some internal commission
         | if it comes to quality of creative work?
        
       | brap wrote:
       | My own impression with SoTA models is that they're very useful
       | for coding, yet they suck ass for solving unique problems (which
       | is the case for every sufficiently large codebase).
        
       | ukFxqnLa2sBSBf6 wrote:
       | There's a few things I'm not understanding here.
       | 
       | 1. Did the benchmark authors not review the issues and make sure
       | the solution was not present in the issue?
       | 
       | 2. Are the issues locked after they're included in the dataset?
       | You'd think they would be immutable for reproducibility.
       | 
       | 3. For the agents writing patches, is test running part of their
       | inner loop validation? If they write a patch that makes the test
       | pass, then the jobs done. Or is that validation step kept secret
       | from the agent? I don't see how unless the tests aren't part of
       | the repo.
        
         | jbellis wrote:
         | Especially with swe-verified, I thought that was the whole
         | point of that dataset
        
           | flakiness wrote:
           | This was also my first thought, but reading [1] again, what
           | they did was labeling like:
           | 
           | > Whether we consider the issue description to be
           | underspecified and hence unfair to be testing on. > Whether
           | the FAIL_TO_PASS unit tests filter out valid solution
           | 
           | and a bit more. This is pointed out in the linked paper too.
           | 
           | The moral of the story to me is that, don't believe the paid
           | human annotator. You can (hopefully) still believe the PhD
           | students doing these unpaid jobs as their research ;-)
           | 
           | [1] https://openai.com/index/introducing-swe-bench-verified/
        
         | sebzim4500 wrote:
         | >1. Did the benchmark authors not review the issues and make
         | sure the solution was not present in the issue?
         | 
         | I looked at a bunch of issues in the dataset when SWE-verified
         | first game out and I was trying to make scaffolding to solve it
         | and I don't remember a single time where the solution existed
         | verbatim in the issue. I'm not saying it never happens, but it
         | would have to be rare.
         | 
         | > 2. Are the issues locked after they're included in the
         | dataset?
         | 
         | No one changes the issues in the dataset but of course the
         | original issue on github will have been resolved long ago. The
         | models don't have access to this in their context, but if they
         | were trained on github there's a very real risk that they've
         | seen the solution.
         | 
         | > 3. For the agents writing patches, is test running part of
         | their inner loop validation? If they write a patch that makes
         | the test pass, then the jobs done. Or is that validation step
         | kept secret from the agent? I don't see how unless the tests
         | aren't part of the repo.
         | 
         | The tests aren't provided to the model, they are run after the
         | model has proposed its final answer.
        
       | MattDaEskimo wrote:
       | There's a serious issue with benchmarks.
       | 
       | Instead of resolving it, some leaders are further complicating
       | their meaning
       | 
       | Such as OpenAI grading their benchmarks based on "how much money
       | they made" or "how easy a model was convinced to hand over fake
       | money".
        
       | huac wrote:
       | > 32.67% of the successful patches involve cheating as the
       | solutions were directly provided in the issue report or the
       | comments.
       | 
       | Looking at the benchmark, https://www.swebench.com/, about half
       | of scored submissions score under 1/3 correct? So they're either
       | not cheating, or not cheating effectively?
        
         | nraynaud wrote:
         | yeah, in the abstract they demoted the score from 12% to 3%, so
         | sadly retirement is not yet here :(
        
         | sebzim4500 wrote:
         | LLMs do not reliably reproduce their training data. This is
         | quite easy to demonstrate, every LLM has been trained on all of
         | wikipedia (at minimum) and yet there if you ask it a niche fact
         | mentioned once on wikipedia it is highly likely to get it
         | wrong.
        
           | huac wrote:
           | that comment refers to the test time inference, i.e. what the
           | model is prompted with, not to what it is trained on. this
           | is, of course, also a tricky problem (esp over long context,
           | needle in a haystack), but it should be much easier than
           | memorization.
           | 
           | anyways, another interpretation is that the model needs to
           | also make a decision on if the code in the issue is a
           | reliable fix or not too
        
           | feznyng wrote:
           | This is why I'm a bit skeptical of the o3 results. If it's
           | spending a bunch of time reasoning aren't the chances of it
           | simply regurgitating a solution it saw in its training data
           | at some point in its output stream higher? It still needs to
           | be clever enough to identify it as the correct answer but
           | it's not as impressive as an original solution.
        
             | sebzim4500 wrote:
             | I would guess that reasoning models would generalize better
             | (i.e. have a smaller discrepency between stuff in the
             | training set and stuff out of it) but it would be very
             | interesting to check.
        
           | fooker wrote:
           | Larger llms do pretty well with this.
           | 
           | Smaller ones don't.
        
             | sebzim4500 wrote:
             | Large ones do better than small ones but still do worse
             | than I would have expected before I tested them. E.g. `o1`
             | doesn't know things which are repeated several times on
             | wikipedia.
        
               | fooker wrote:
               | o1 is not too large, and the emphasis is on reasoning
               | rather than memorization.
               | 
               | Try the largest llama models, and phrase your prompt like
               | a sentence to be completed instead of you asking a
               | question.
        
       | bearjaws wrote:
       | I would argue almost every popular benchmark quoted by the big
       | LLM companies is tainted.
       | 
       | OAI, xAI, Antropic, Google all score incredibly well, then you go
       | to try and write code and its just _okay_.
       | 
       | They claim it can do PHD level reasoning, but here I am not
       | trusting it on basic computational thinking.
        
         | jandrese wrote:
         | Yeah, that's true in many fields with these AI agents. They
         | demo well, but when you put them to actual work they fall right
         | on their face. Even worse, the harder the task you set for them
         | the more they lie to you. It's like hiring a junior dev from
         | one of those highly regimented societies where it's more
         | important to save face than to get the job done.
        
           | aprilthird2021 wrote:
           | Your last sentence feels kind of spot on. The lack of
           | transparency around confidence in the answer makes it hard to
           | use (and I know it would not be simple to add such a thing)
        
             | hackernewds wrote:
             | sounds like a skill issue to be honest. you could probably
             | tell the assistant to just ask you questions when
             | information is missing instead
        
               | ryoshu wrote:
               | Programming is easy. Asking the right question is hard.
               | 
               | People don't know what questions to ask.
        
           | Xelynega wrote:
           | It's almost as if they're not trying to market to the people
           | actually using the products, but trying to convince investors
           | of features that don't exist
        
             | ilrwbwrkhv wrote:
             | Its the good old Elon musk playbook spread out across the
             | industry.
        
         | hackernewds wrote:
         | The popular benchmarks are the ones that have already leaked.
         | think about it
        
         | vonneumannstan wrote:
         | >They claim it can do PHD level reasoning, but here I am not
         | trusting it on basic computational thinking.
         | 
         | Not sure that's really the claim. I think they claim that
         | performance on benchmarks like GPQA indicate PhD level
         | knowledge of different fields.
        
           | AyyEye wrote:
           | That is the message, it's never usually stated in such a
           | succinct and deniable way.
        
         | washadjeffmad wrote:
         | To be totally fair, using PhD as a barometer of anything
         | without specifying what is like claiming that LLMs have
         | encyclopedic knowledge while meaning a children's encyclopedia.
        
       | htrp wrote:
       | Paper from October 2024
        
       | shayanh wrote:
       | I found that this paper was submitted to ICLR, but got rejected:
       | https://openreview.net/forum?id=pwIGnH2LHJ
       | 
       | To me the analysis of SWE-Bench is a solid contribution and
       | informative. My guess is that to meet conference's submission bar
       | they had to come up with their own bench (SWE-Bench+), which
       | wasn't thorough enough and the paper got rejected mainly because
       | of that.
        
         | vonneumannstan wrote:
         | Acceptance or rejection at big ML Conferences doesn't seem to
         | carry much signal either way anymore. Completely saturated by
         | grift and poor quality so each paper should be evaluated
         | independent of their Conference status imo.
        
       | OldGreenYodaGPT wrote:
       | > solutions were directly provided in the issue report or the
       | comments
       | 
       | This is fine, many of my real tickets already explain the
       | solution. A good ticket often offers a solution or where to start
       | looking.
        
         | softwaredoug wrote:
         | Yep that's fine for an issue, but a problem if you're trying to
         | eval whether AIs can solve coding problems.
        
       | ionwake wrote:
       | I was wondering how long this would take to surface, you can tell
       | a surprising amount just by carefully watching how the trainers
       | answer interview questions, which is kinda meta really.
        
       | comex wrote:
       | Some of the examples in the paper seem to be wrong.
       | 
       | For django-31056, they claim the AI-generated patch is
       | "incomplete" because it's "missing critical parts of this logic,
       | such as the try-except block and the check for a running event
       | loop.". But if you look at the diff, that's clearly wrong. The
       | try-except block and running check were _already there_ before
       | the patch. The human patch just indented them, making them appear
       | as both - and +, while the AI patch didn 't. To me, the AI patch
       | seems correct. It's slightly less efficient than the human patch
       | when DJANGO_ALLOW_ASYNC_UNSAFE is set, but slightly _more_
       | efficient when it isn 't (which is the common case!). The human
       | patch does feel more natural, but the AI patch is fine. I'd grade
       | it a tie between human and AI.
       | 
       | For django-32517, they claim that the human and AI patches
       | "produce entirely different outputs", but actually they do
       | exactly the same thing. The human version has
       | `reversed(self.dict)`, while the AI version has
       | `reversed(self.dict.keys())`. `reversed` treats the object as an
       | iterator, and iterating over a dictionary in Python just gives
       | you the keys, so it doesn't matter whether you call `.keys()`
       | first. The human patch is more idiomatic, but it's also more
       | confusing, as shown by the fact that it confused the authors of
       | this paper. I'd grade it another tie.
       | 
       | Edit: I tried to sign up for OpenReview so I could leave a
       | comment about this, but the system wouldn't let me register
       | without completing a form that assumes you have an academic
       | position. Perhaps I should email the authors.
        
         | fourpostmaun2 wrote:
         | The entire premise of this paper is false. They claim that the
         | "hints_text" is used and leaks the answer in Section 2.1.1;
         | however, the authors of SWE-Bench themselves state that this is
         | not used anywhere (Issue #133 on the official SWE-Bench
         | GitHub).
         | 
         | According to the paper:
         | 
         | > 1. Solution leak: represents instances where the solution to
         | the issue is clearly outlined in the issue description or
         | comments on GitHub. Since both the issue descriptions and
         | comments (referred to as hints_text in the SWE-Bench study) are
         | provided as input to the models, these LLM models can extract
         | the solutions directly from this information instead of
         | generating it independently.
         | 
         | And yet, the SWE-Bench authors themselves explicitly state:
         | 
         | > In short, for participating on the SWE-bench leaderboard,
         | using hints_text in any manner is not allowed. Although we
         | don't explicitly say this in the original paper, we also do not
         | make any mention of using the hints_text anywhere.
         | 
         | So, it's a made up issue that would only occur if you deviated
         | from the paper implementation and explicitly added a field
         | called "hints" that isn't used anywhere.
        
         | petters wrote:
         | I think you should. Looks like there is more work to do
        
       | perrygeo wrote:
       | The solution moving forward has to be private benchmark suites. I
       | could see teams investing in their own set of programming
       | challenges and periodically re-evaluating them - similar to how
       | we would construct sets of live interview questions for
       | candidates and _qualitatively_ assess their ability.
       | 
       | It's so vital that it's not leaked and that it's fit-for-purpose
       | and manually assessed. These general purpose, public benchmarks
       | based on questionable metrics are effectively worthless to assess
       | real programming skill.
       | 
       | Case in point, as others have mentioned here, Claude scores
       | modestly on these benchmarks but vastly better than the
       | alternatives in practice. I don't trust Claude fully but far more
       | than OpenAI models; it's not even close. The IRL performance
       | advantage is not reflected in any of these benchmarks.
        
       | 1024core wrote:
       | To quote Goodhart's Law: When a measure becomes a target, it
       | ceases to be a good measure.
       | 
       | Or, as in the case of LLMs and benchmarks: When a benchmark
       | becomes a target, it ceases to be a good benchmark.
        
       | dang wrote:
       | Submitted title was "SWE-Bench tainted by answer leakage; real
       | pass rates significantly lower". Normally we'd replace that with
       | the article title, in keeping with the site guideline (" _Please
       | use the original title, unless it is misleading or linkbait; don
       | 't editorialize._"), but in this case the article title is so
       | generic that this is arguably misleading as well, so I took a
       | representative phrase from the abstract instead. That's
       | preferable, because it's better to use the authors' own
       | representation of their article.
       | 
       | If anyone can find a better title (i.e. more accurate and
       | neutral, preferably using language from the article itself) we
       | can change it again.
       | 
       | https://news.ycombinator.com/newsguidelines.html
        
       ___________________________________________________________________
       (page generated 2025-02-21 23:00 UTC)