[HN Gopher] Some critical issues with the SWE-bench dataset
___________________________________________________________________
Some critical issues with the SWE-bench dataset
Author : joshwa
Score : 287 points
Date : 2025-02-21 17:59 UTC (5 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| acc_297 wrote:
| > 32.67% of the successful patches involve cheating as the
| solutions were directly provided in the issue report or the
| comments.
|
| Is this what Hofstadter means by a strange-loop?
| andrepd wrote:
| Turns out "AI deep research reasoning agent" was just "we can
| print the training set"
| thegeomaster wrote:
| ...by piping it through the world's most inefficient echo
| function.
| xrd wrote:
| You should immediately publish a paper on Arvix with your
| revolutionary IEF brand, an improvement on transformers and
| mamba architectures. Then, like Ilya, take $1B in funding
| the following week.
| sva_ wrote:
| That reminds me of someone calling the Bitcoin blockchain
| the most expensive linked list in the world.
| wongarsu wrote:
| The difference is that Bitcoin is designed to be "just"
| an append-only* timestamped linked list, with some rules
| on how a new node can look like in order to be
| successfully appended. Making the creation of a canonical
| linked list possible between hostile actors is the whole
| innovation. The currency stuff is "just" a cool practical
| application tacked on the linked list
|
| LLMs by contrast are not designed to just repeat what's
| already in the instructions, no matter which stance on
| LLM design you subscribe to
|
| * exceptions apply
| modeless wrote:
| > When we filtered out these problematic issues, the resolution
| rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%.
|
| This matches my intuition about the coding performance of these
| models a lot better. I don't think any current coding benchmark
| accurately measures coding performance.
| alfalfasprout wrote:
| Yep anecdotally that's basically spot-on. It's also one of the
| reasons that I still find copilot vastly more useful than
| highly autonomous AI tooling (cursor, roocode, avante, etc.)
| OsrsNeedsf2P wrote:
| Anecdotal but I was always shocked to see Claude 3.5 perform so
| poorly in the benchmarks, when it generates 80% of my code in
| Cursor (and in cases it fails, no other model succeeds)
| TheDong wrote:
| Different people seem to get wildly different results here,
| and I'm not sure what percentage is down to the type of
| software being built vs the usage patterns.
|
| In my case, I would guess less than 10% of the code I get out
| of AIs is useful.
|
| What sort of code are you getting those results with? Is it
| yet-another-react-frontend-button? Is it ebpf programs? Is it
| a parser in rust?
|
| For the latter two, I've found AI to have pretty low rates,
| and for the former I haven't had the desire to try.
| aprilthird2021 wrote:
| My gut tells me the AIs will be best for small web projects
| that are greenfield. The kind a 1-3 person team could
| maintain.
|
| And my gut tells me they are the worst for the kinds of
| long-established software conglomerates many professionals
| work at, which have tons of internal services, integrated
| acquisitions, etc. etc.
|
| Ultimately the AI is good at what the average developer
| online is good at, probably full-stack web dev of projects
| from scratch.
| dingnuts wrote:
| but that kind of code is so easy to write, and code is
| already way more terse than natural language! it's
| literally more typing to explain to an LLM how to write
| some greenfield web CRUD than it is to just type out the
| code, and if there's a lot of boilerplate it's faster to
| generate the repetitive parts with keyboard macros!
|
| where's the value everyone on this site and on LinkedIn
| (but NONE in my real or professional life) seems to get?
|
| I feel like I'm being gaslit when people say Cursor
| writes 80% of their code, and honestly, it's the
| conclusion that makes the most sense to me -- the people
| making these posts must be well-invested in the startups
| that stand to profit if AI is actually as good as they
| say. You know, shills.
| pdntspa wrote:
| I work on web crawlers and data mining at scale and well
| over 50% of my code output is written by AI. I use mostly
| o1 (copying and pasting isolated snippets) or Jetbrains'
| AI service.
|
| I also have access to a full-service "junior developer"
| AI that can take in an entire git repo at once, and its
| code outputs are significantly less useful -- maybe 10%.
|
| I think a lot of peoples' success rate with AI boils down
| to their choices in language/toolkit (AI does much better
| the more common it is) and how they prompt it.
|
| Note that you still need an experienced set of eyes
| supervising, the thought of an LLM committing to a git
| repo without a human in the loop scares me.
| botanical76 wrote:
| Have you tried the AI intellisense models like Copilot?
|
| I don't understand the notion that it is faster to
| generate repetitive code with keyboard macros. I use Vim-
| mode exclusively, and while I'm not a Vim master, I don't
| think there's any set of macros that will do what Copilot
| can do.
|
| It's not that Copilot is smart. It's that 60% of what I
| do doesn't require much intelligence to anticipate. It is
| the 40% that matters, the remainder can be trivially
| guessed, and this is exactly what Copilot does.
|
| Maybe this will help: you need to imagine with an AI
| intellisense that with each keystroke, you are collapsing
| the possibility space down to a smaller, finite number of
| outcomes. You write exactly what code you need for the
| dumb AI to predict the rest of it.
|
| There are a LOT of reasons why AI intellisense is not all
| there yet; it can be distracting; it can try to generate
| too much at once; none of the tools have LSP integrated,
| so it will provide bullshit suggestions of library
| methods that don't exist. This is all true, and yet it is
| still highly valuable in some domains, for some people.
|
| That said, if you write x86 assembly for a living, you
| are probably out of luck.
|
| (I write Kotlin, Java for Android apps and services, C++
| that is tightly integrated with the SoC. Python and Bash
| for command-line tools that invoke REST APIs. Copilot is
| useful for these domains.)
| doug_durham wrote:
| Your intuition runs counter to most folks experience. I
| work on complex machine learning training loops and loss
| functions. LLMs work great on that.
| pitpatagain wrote:
| I work in machine learning research: training loops and
| loss functions are incredibly repetitive and pattern
| filled, highly represented in the code the LLMs are
| trained on, and typically short. They are exactly my
| intuition of simple code that LLMs would work well on.
| Philip-J-Fry wrote:
| I'm pretty confident in my ability to write any code in my
| main language. But AI is still very useful in just filling
| out boiler plate, or noticing a pattern and filling out the
| rest of some repetitive code. Or say, I need to write
| wrapper around a common command-line utility. It's pretty
| good at generating the code for that.
|
| What I mostly enjoy using it for is just writing bash
| scripts for me. I hate writing bash but Claude is excellent
| at writing the scripts I need.
|
| AI isn't writing software features or anything close to
| that for me at the moment. But what it is great at is just
| being a really excellent intellisense. Knowing what you're
| likely to want to do in the next ~5 lines and just filling
| it out in one button press. Things like intellisense and
| automatic refactoring tools were big productivity
| improvements when they became ubiquitous. AI will be the
| same for most people, an intellisense on steroids.
|
| Also, writing tests. Writing tests can be quite mundane and
| boring. But I can just type out what I want tested, give it
| some files as context and it can be pretty good at
| generating some tests.
|
| Does AI get it right every time? No way. But, as a
| developer, I'd rather spend 10 minutes trying to coax an AI
| into generating me 90% useable code for some boring task
| than spend 20 minutes typing it out myself. Often, I
| probably could write the code faster than I could prompt an
| AI, but being lazy and telling something else to do the
| work feels pretty good and relaxing.
| ghaff wrote:
| >AI is still very useful in just filling out boiler plate
|
| That's what I tend to find with English writing as well.
| It's not great. But sometimes you just need decent
| generic prose for an introduction or an explanation of
| something. If you know enough to adjust as needed, it can
| save time for something that readers are probably just
| skimming anyway. As I've written previously, about a year
| ago I was working on cleaning up a bunch of reference
| architectures and I used Google's Bard in that case to
| give me a rough draft of background intros for some of
| them which I modified as needed. Nothing miraculous but
| saved me a bit of time.
| alfalfasprout wrote:
| Almost every time someone says "but most of my code
| nowadays is LLM generated" it's usually one of three
| things:
|
| 1. Very greenfield work where the LLM doesn't _really_ have
| a lot of constraints to deal with and can fully control the
| setup + doesn 't have to ingest a lot of existing context
| 2. Very small projects that largely follow established
| patterns (CRUD, frontends, etc.) 3. Well established
| implementation work (the kind of feature that's a simple
| JIRA ticket).
|
| In my experience they're painfully bad at:
|
| - Novel/niche work where there aren't really answers online
| to what you're trying to do - Complex refactoring -
| Architecting within existing constraints (other systems,
| etc.)
| cyanydeez wrote:
| I assume it's the commoditized work that made India a
| success at outsourced activities.
| ben_w wrote:
| > For the latter two, I've found AI to have pretty low
| rates, and for the former I haven't had the desire to try.
|
| Similar. I've got a joke language project on the back
| burner, doing it properly requires going back over my 23
| year old university notes on yacc etc., so I tried AI...
| the AI just makes a mess of it*.
|
| For anything front end, even the original ChatGPT-3.5 model
| is basically magic (i.e. sufficiently advanced technology).
|
| * I think the last time I touched it was just before o1 was
| announced; as o3 is now in the free tier of ChatGPT, I
| should try again...
| duped wrote:
| I think it's frontend javascript versus everything else.
|
| There's a few languages/tools I use often but am not an
| expert in and have been using Claude 3.5 to help me work
| with existing code. On paper this is a perfect use case. In
| practice it's like working with an intern that has google
| in front of them and enough jargon to convince me what
| they're saying isn't bullshit. Eventually, I'll be able to
| coax the answers I need out of it.
|
| I'll say though the fact AI can't say "I don't know" and
| closely related "that is not possible in the context you've
| given me" combined with the inability to reason is what
| gives you results that look OK but are subtly trash.
| cyanydeez wrote:
| the reason AI is a giant f'n bubble is because doing an
| exhaustive review of results means _doing the actual work
| of millions of manhours_.
|
| Instead, people squint their eyes at scrolling matrix
| text and convince themselves it must be true.
| flir wrote:
| I strongly suspect it's a function of the amount of text it
| was trained on.
|
| The number of questions about each language/framework on
| Stack Overflow is probably a reasonable proxy for a
| chatbot's competence in that language.
|
| (Drawing that graph might actually be a useful exercise. If
| it outperforms expectations on really obscure languages,
| maybe something interesting is happening).
| serjester wrote:
| I write a lot of Python and personally I find Claude
| significantly worse than OpenAI's reasoning models. I really
| feel like this varies a ton language to language.
| dgunay wrote:
| I've been using LLMs for tab autocomplete for a while and
| just recently started trying out agentic coding AI (Copilot
| Edits and Cline). I think the disappointing shortfall of
| agentic AIs (at least for me) comes from the feedback loop
| being so much looser than the autocomplete style. With
| autocomplete, I don't have to actively think about what
| context to feed it, and I can gently correct it if it goes in
| the wrong direction on a line-by-line basis. With AI agents,
| they have a lot more leeway to generate a ton of code and
| reason themselves off the rails before you're able to step in
| and correct them. Now granted, I am also not very good yet at
| managing context and crafting prompts, but it feels a lot
| harder to get good at than simply dropping an AI
| autocompleter into an existing programming workflow. It's a
| new paradigm.
| rco8786 wrote:
| That's been my experience too, but I would guess the problem
| of "here is a ton of context, produce a small amount of code"
| is significantly better suited for LLMs than "here is a
| problem, produce a ton of code".
| modeless wrote:
| When I use Cursor I ask for specific small tasks that I know
| it should be able to handle. Larger, open-ended tasks fail
| almost universally for me.
| siva7 wrote:
| o3-mini and gpt-4o are so piss poor in agent coding compared to
| claude that you don't even need a benchmark
| jbellis wrote:
| o3-mini-medium is slower than claude but comparable in
| quality. o3-mini-high is even slower, but better.
| danielbln wrote:
| Claude really is a step above the rest when it comes to
| agentic coding.
| theturtletalks wrote:
| I personally use Aider's Polyglot Benchmark [0] which is a bit
| low-key and not gamed just yet. It matches my experience too
| where Claude Sonnet 3.5 is the best and still beats the new
| reasoning models like o3-mini, DeepSeek, etc.
|
| 0. https://aider.chat/docs/leaderboards/
| azinman2 wrote:
| Would love if they put latency in this too.
| nyrikki wrote:
| Quite the corpus for Exercism tasks that were almost
| certainly trained on, which could lead this to doing what we
| know LLM/LRM's are good at...approximate retrieval.
|
| https://github.com/search?q=Exercism&type=repositories
| yunwal wrote:
| Are Exercism coding exercises really low key? I thought it
| was like the standard free platform for learning a new
| language now
| theturtletalks wrote:
| Low-key as in many people don't check this leaderboard as
| much as the other high profile ones.
| KaoruAoiShiho wrote:
| Sonnet is literally lower on the aider benchmark you just
| linked. It's only the top with Deepseek as architect,
| otherwise it's lower than many others.
| refulgentis wrote:
| Let's steelman a bit: once you multiply out the edit
| accuracy versus completion accuracy, Sonnet, on its own, is
| within 5% of the very top one not using sonnet.
| theturtletalks wrote:
| Yes, but I use Cursor Composer Agent mode with Sonnet which
| is like Aider's architect mode where 1 LLM is instructing
| another one. Not to mention the new reasoning models can't
| use tool calling (except o3-mini which is not multi-modal).
| KaoruAoiShiho wrote:
| Me too, cursor+sonnet is also my go to, I just didn't
| really understand what you were getting at by pointing
| out this benchmark. I guess it is significant that Sonnet
| is the actual line by line coder here. It is the best at
| that, and it's better than DeepSeek+any other combination
| and better than Any other reasoner+Sonnet.
| theturtletalks wrote:
| Yes I've followed this benchmark for a while and before
| Deepseek + Sonnet Architect took the top spot, Sonnet was
| there alone followed by o1 and Gemini EXP. This is one of
| the few benchmarks where Sonnet is actually on top like
| my experience shows, other popular ones have 03-mini and
| DeepSeek r1 which fall short in my opinion.
| delusional wrote:
| > where the resolution rates of the models drop significantly,
| which are 0.73%, 0.55%, and 3.83%, respectively.
|
| Matches my experience pretty well as too. It'll usually output
| something that a novice would assume is correct but an expert
| can clearly identify as "know it all teenager forum post" level
| stuff.
| 0x20cowboy wrote:
| It matches my experience as well.
|
| I find the models very useful to chat about library
| documentation or high level algorithm concepts, but I find the
| code it generates to be... I don't know how else to say it...
| really bad and often out of context.
|
| I know developers who blindly follow the hype and use them to
| generate production code. That scares the poop emoji out of me,
| and the code reads like an asset flipped 3D game.
| avs733 wrote:
| It is worth reflecting, as much as HN seems to hate the social
| sciences, on this point. But the difficulty of measuring
| intelligence is a challenge that several fields have struggled
| with for decades. It is inherently hard because defining
| intelligence and building intelligence are very closely
| coupled. This both makes it hard to make unbiased measures as
| well making measures that don't affect the phenomenon basically
| NP hard, or known as the Flynn effect[0].
|
| It also goes to how a lot of people misunderstand the
| replication crisis. 'Hard science' really should replicate - we
| should be able filter out sources fo error and variance because
| the phenomena (generally) isn't affected by our attempts to
| measure it. Making social science replicate often requires so
| much control that it is deabstracted from reality, meaning the
| effort at replication reduces the value and usefulness of the
| knowledge. Generalizable claims are hard because the sources of
| variance are so much larger adn more complex. Speaking as
| someone who went through a transition from engineering to
| social sciences, it is the concept that made it hard. I started
| my time in social sciences with a cool idea of a whole carrer
| based on just doing replication studies, because science. That
| was...useful and stupid at the same time.
|
| [0] https://en.wikipedia.org/wiki/Flynn_effect
| semi-extrinsic wrote:
| So what we need is something like a versioned crowdsourced coding
| LLM eval dataset.
|
| Every quarter, you have a couple thousand volunteers provide 2
| GitHub issues from the past 3 months, which are nontrivial to
| resolve, and where there exists strong test cases. Each volunteer
| then cross-checks 2 issues from other volunteers. The volunteers
| get 1 month free subscription to some AI service in return.
|
| This dataset is then published as SWE-UberBench-2025-02 or
| something. People can then only evaluate their coding LLM on
| datasets published after their training period.
| SR2Z wrote:
| Right, so that AI companies can freely throw this significantly
| more valuable training data into a model and then turn around
| and advocate for clamping down on the freedom of models.
| delusional wrote:
| And why would these "couple of thousand volunteers" help with
| this?
| rsynnott wrote:
| And how would you ensure that all of them were really
| volunteers and not colluding with the vendors? Like, tech
| companies cheating on benchmarks is an old, old story
| (personal favourite: in the dark ages, before 3D
| acceleration, some graphics card drivers, on detecting a 2D
| acceleration benchmark, would _simply draw the wrong thing_),
| and I wouldn't trust at least three of the major players as
| far as I could throw them.
| delusional wrote:
| I'm pretty sure my bios still contains an option to
| "improve performance of 3dmark 8" or something similar.
| nitwit005 wrote:
| If you know some way to get people to volunteer millions of
| dollars of free labor, there are better uses of their time than
| evaluating LLMs.
| otterley wrote:
| I am shocked-- _shocked_ --when a vendor cheats in order to
| increase their benchmark scores.
|
| I always tell my customers to ignore benchmarks and compare
| outcomes with their own workloads. Benchmarks are almost
| completely useless in the real world.
| Snuggly73 wrote:
| I only trust benchmarks that I've faked myself :)
| commandlinefan wrote:
| Although I believe there's a lot of this going on, in this case
| it just appears to be incompetence rather than malice.
| adamc wrote:
| I don't know why you are getting downrated. That is sane
| advice.
| optimalsolver wrote:
| You need benchmarks with the following three properties:
|
| 1) No known solutions, so there's no "ground truth" dataset to
| train on
|
| 2) Presumably hard to solve
|
| 3) But easy to verify a solution if one is provided.
|
| This, of course, is easier done on the STEM side of things, but
| how do you automatically test creativity, or philosophical
| aptitude?
| hsuduebc2 wrote:
| I guess it's purely subjective. Maybe some internal commission
| if it comes to quality of creative work?
| brap wrote:
| My own impression with SoTA models is that they're very useful
| for coding, yet they suck ass for solving unique problems (which
| is the case for every sufficiently large codebase).
| ukFxqnLa2sBSBf6 wrote:
| There's a few things I'm not understanding here.
|
| 1. Did the benchmark authors not review the issues and make sure
| the solution was not present in the issue?
|
| 2. Are the issues locked after they're included in the dataset?
| You'd think they would be immutable for reproducibility.
|
| 3. For the agents writing patches, is test running part of their
| inner loop validation? If they write a patch that makes the test
| pass, then the jobs done. Or is that validation step kept secret
| from the agent? I don't see how unless the tests aren't part of
| the repo.
| jbellis wrote:
| Especially with swe-verified, I thought that was the whole
| point of that dataset
| flakiness wrote:
| This was also my first thought, but reading [1] again, what
| they did was labeling like:
|
| > Whether we consider the issue description to be
| underspecified and hence unfair to be testing on. > Whether
| the FAIL_TO_PASS unit tests filter out valid solution
|
| and a bit more. This is pointed out in the linked paper too.
|
| The moral of the story to me is that, don't believe the paid
| human annotator. You can (hopefully) still believe the PhD
| students doing these unpaid jobs as their research ;-)
|
| [1] https://openai.com/index/introducing-swe-bench-verified/
| sebzim4500 wrote:
| >1. Did the benchmark authors not review the issues and make
| sure the solution was not present in the issue?
|
| I looked at a bunch of issues in the dataset when SWE-verified
| first game out and I was trying to make scaffolding to solve it
| and I don't remember a single time where the solution existed
| verbatim in the issue. I'm not saying it never happens, but it
| would have to be rare.
|
| > 2. Are the issues locked after they're included in the
| dataset?
|
| No one changes the issues in the dataset but of course the
| original issue on github will have been resolved long ago. The
| models don't have access to this in their context, but if they
| were trained on github there's a very real risk that they've
| seen the solution.
|
| > 3. For the agents writing patches, is test running part of
| their inner loop validation? If they write a patch that makes
| the test pass, then the jobs done. Or is that validation step
| kept secret from the agent? I don't see how unless the tests
| aren't part of the repo.
|
| The tests aren't provided to the model, they are run after the
| model has proposed its final answer.
| MattDaEskimo wrote:
| There's a serious issue with benchmarks.
|
| Instead of resolving it, some leaders are further complicating
| their meaning
|
| Such as OpenAI grading their benchmarks based on "how much money
| they made" or "how easy a model was convinced to hand over fake
| money".
| huac wrote:
| > 32.67% of the successful patches involve cheating as the
| solutions were directly provided in the issue report or the
| comments.
|
| Looking at the benchmark, https://www.swebench.com/, about half
| of scored submissions score under 1/3 correct? So they're either
| not cheating, or not cheating effectively?
| nraynaud wrote:
| yeah, in the abstract they demoted the score from 12% to 3%, so
| sadly retirement is not yet here :(
| sebzim4500 wrote:
| LLMs do not reliably reproduce their training data. This is
| quite easy to demonstrate, every LLM has been trained on all of
| wikipedia (at minimum) and yet there if you ask it a niche fact
| mentioned once on wikipedia it is highly likely to get it
| wrong.
| huac wrote:
| that comment refers to the test time inference, i.e. what the
| model is prompted with, not to what it is trained on. this
| is, of course, also a tricky problem (esp over long context,
| needle in a haystack), but it should be much easier than
| memorization.
|
| anyways, another interpretation is that the model needs to
| also make a decision on if the code in the issue is a
| reliable fix or not too
| feznyng wrote:
| This is why I'm a bit skeptical of the o3 results. If it's
| spending a bunch of time reasoning aren't the chances of it
| simply regurgitating a solution it saw in its training data
| at some point in its output stream higher? It still needs to
| be clever enough to identify it as the correct answer but
| it's not as impressive as an original solution.
| sebzim4500 wrote:
| I would guess that reasoning models would generalize better
| (i.e. have a smaller discrepency between stuff in the
| training set and stuff out of it) but it would be very
| interesting to check.
| fooker wrote:
| Larger llms do pretty well with this.
|
| Smaller ones don't.
| sebzim4500 wrote:
| Large ones do better than small ones but still do worse
| than I would have expected before I tested them. E.g. `o1`
| doesn't know things which are repeated several times on
| wikipedia.
| fooker wrote:
| o1 is not too large, and the emphasis is on reasoning
| rather than memorization.
|
| Try the largest llama models, and phrase your prompt like
| a sentence to be completed instead of you asking a
| question.
| bearjaws wrote:
| I would argue almost every popular benchmark quoted by the big
| LLM companies is tainted.
|
| OAI, xAI, Antropic, Google all score incredibly well, then you go
| to try and write code and its just _okay_.
|
| They claim it can do PHD level reasoning, but here I am not
| trusting it on basic computational thinking.
| jandrese wrote:
| Yeah, that's true in many fields with these AI agents. They
| demo well, but when you put them to actual work they fall right
| on their face. Even worse, the harder the task you set for them
| the more they lie to you. It's like hiring a junior dev from
| one of those highly regimented societies where it's more
| important to save face than to get the job done.
| aprilthird2021 wrote:
| Your last sentence feels kind of spot on. The lack of
| transparency around confidence in the answer makes it hard to
| use (and I know it would not be simple to add such a thing)
| hackernewds wrote:
| sounds like a skill issue to be honest. you could probably
| tell the assistant to just ask you questions when
| information is missing instead
| ryoshu wrote:
| Programming is easy. Asking the right question is hard.
|
| People don't know what questions to ask.
| Xelynega wrote:
| It's almost as if they're not trying to market to the people
| actually using the products, but trying to convince investors
| of features that don't exist
| ilrwbwrkhv wrote:
| Its the good old Elon musk playbook spread out across the
| industry.
| hackernewds wrote:
| The popular benchmarks are the ones that have already leaked.
| think about it
| vonneumannstan wrote:
| >They claim it can do PHD level reasoning, but here I am not
| trusting it on basic computational thinking.
|
| Not sure that's really the claim. I think they claim that
| performance on benchmarks like GPQA indicate PhD level
| knowledge of different fields.
| AyyEye wrote:
| That is the message, it's never usually stated in such a
| succinct and deniable way.
| washadjeffmad wrote:
| To be totally fair, using PhD as a barometer of anything
| without specifying what is like claiming that LLMs have
| encyclopedic knowledge while meaning a children's encyclopedia.
| htrp wrote:
| Paper from October 2024
| shayanh wrote:
| I found that this paper was submitted to ICLR, but got rejected:
| https://openreview.net/forum?id=pwIGnH2LHJ
|
| To me the analysis of SWE-Bench is a solid contribution and
| informative. My guess is that to meet conference's submission bar
| they had to come up with their own bench (SWE-Bench+), which
| wasn't thorough enough and the paper got rejected mainly because
| of that.
| vonneumannstan wrote:
| Acceptance or rejection at big ML Conferences doesn't seem to
| carry much signal either way anymore. Completely saturated by
| grift and poor quality so each paper should be evaluated
| independent of their Conference status imo.
| OldGreenYodaGPT wrote:
| > solutions were directly provided in the issue report or the
| comments
|
| This is fine, many of my real tickets already explain the
| solution. A good ticket often offers a solution or where to start
| looking.
| softwaredoug wrote:
| Yep that's fine for an issue, but a problem if you're trying to
| eval whether AIs can solve coding problems.
| ionwake wrote:
| I was wondering how long this would take to surface, you can tell
| a surprising amount just by carefully watching how the trainers
| answer interview questions, which is kinda meta really.
| comex wrote:
| Some of the examples in the paper seem to be wrong.
|
| For django-31056, they claim the AI-generated patch is
| "incomplete" because it's "missing critical parts of this logic,
| such as the try-except block and the check for a running event
| loop.". But if you look at the diff, that's clearly wrong. The
| try-except block and running check were _already there_ before
| the patch. The human patch just indented them, making them appear
| as both - and +, while the AI patch didn 't. To me, the AI patch
| seems correct. It's slightly less efficient than the human patch
| when DJANGO_ALLOW_ASYNC_UNSAFE is set, but slightly _more_
| efficient when it isn 't (which is the common case!). The human
| patch does feel more natural, but the AI patch is fine. I'd grade
| it a tie between human and AI.
|
| For django-32517, they claim that the human and AI patches
| "produce entirely different outputs", but actually they do
| exactly the same thing. The human version has
| `reversed(self.dict)`, while the AI version has
| `reversed(self.dict.keys())`. `reversed` treats the object as an
| iterator, and iterating over a dictionary in Python just gives
| you the keys, so it doesn't matter whether you call `.keys()`
| first. The human patch is more idiomatic, but it's also more
| confusing, as shown by the fact that it confused the authors of
| this paper. I'd grade it another tie.
|
| Edit: I tried to sign up for OpenReview so I could leave a
| comment about this, but the system wouldn't let me register
| without completing a form that assumes you have an academic
| position. Perhaps I should email the authors.
| fourpostmaun2 wrote:
| The entire premise of this paper is false. They claim that the
| "hints_text" is used and leaks the answer in Section 2.1.1;
| however, the authors of SWE-Bench themselves state that this is
| not used anywhere (Issue #133 on the official SWE-Bench
| GitHub).
|
| According to the paper:
|
| > 1. Solution leak: represents instances where the solution to
| the issue is clearly outlined in the issue description or
| comments on GitHub. Since both the issue descriptions and
| comments (referred to as hints_text in the SWE-Bench study) are
| provided as input to the models, these LLM models can extract
| the solutions directly from this information instead of
| generating it independently.
|
| And yet, the SWE-Bench authors themselves explicitly state:
|
| > In short, for participating on the SWE-bench leaderboard,
| using hints_text in any manner is not allowed. Although we
| don't explicitly say this in the original paper, we also do not
| make any mention of using the hints_text anywhere.
|
| So, it's a made up issue that would only occur if you deviated
| from the paper implementation and explicitly added a field
| called "hints" that isn't used anywhere.
| petters wrote:
| I think you should. Looks like there is more work to do
| perrygeo wrote:
| The solution moving forward has to be private benchmark suites. I
| could see teams investing in their own set of programming
| challenges and periodically re-evaluating them - similar to how
| we would construct sets of live interview questions for
| candidates and _qualitatively_ assess their ability.
|
| It's so vital that it's not leaked and that it's fit-for-purpose
| and manually assessed. These general purpose, public benchmarks
| based on questionable metrics are effectively worthless to assess
| real programming skill.
|
| Case in point, as others have mentioned here, Claude scores
| modestly on these benchmarks but vastly better than the
| alternatives in practice. I don't trust Claude fully but far more
| than OpenAI models; it's not even close. The IRL performance
| advantage is not reflected in any of these benchmarks.
| 1024core wrote:
| To quote Goodhart's Law: When a measure becomes a target, it
| ceases to be a good measure.
|
| Or, as in the case of LLMs and benchmarks: When a benchmark
| becomes a target, it ceases to be a good benchmark.
| dang wrote:
| Submitted title was "SWE-Bench tainted by answer leakage; real
| pass rates significantly lower". Normally we'd replace that with
| the article title, in keeping with the site guideline (" _Please
| use the original title, unless it is misleading or linkbait; don
| 't editorialize._"), but in this case the article title is so
| generic that this is arguably misleading as well, so I took a
| representative phrase from the abstract instead. That's
| preferable, because it's better to use the authors' own
| representation of their article.
|
| If anyone can find a better title (i.e. more accurate and
| neutral, preferably using language from the article itself) we
| can change it again.
|
| https://news.ycombinator.com/newsguidelines.html
___________________________________________________________________
(page generated 2025-02-21 23:00 UTC)