[HN Gopher] AutoCodeRover: Autonomous Program Improvement
___________________________________________________________________
AutoCodeRover: Autonomous Program Improvement
Author : mechtaev
Score : 87 points
Date : 2024-04-09 10:56 UTC (12 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| mechtaev wrote:
| https://arxiv.org/abs/2404.05427
| draugadrotten wrote:
| did someone here replicate this on their own code?
| wsdookadr wrote:
| at the time of writing this their repo is 12h old. the training
| time isn't stated in the paper. i'm thinking maybe one of these
| robots can replicate this and tell us how it went.
| dboreham wrote:
| See post above. It is expected to be runnable any anyone from
| the git repo contents.
| juujian wrote:
| And the other 78% of time it just creates a bunch of noise that
| someone has to sift through?
| egeozcan wrote:
| That's in my experience better than the percentage of although
| usually good-intentioned but nevertheless unusable PRs popular
| repositories get.
| ectopasm83 wrote:
| The point is that the success rate is progressing, paper after
| paper
|
| > The baseline results of Magis (10%), Devin (14%) are
| evaluated in another subset of SWE-bench, which we cannot
| directly compare with, so we take the results from their
| technical reports as a reference.
|
| Wondering how it compares with these models.
| invalidusernam3 wrote:
| Why not use AutoCodeRover, Magis, and Devin together for 46%
|
| /s
| arp242 wrote:
| Here's a list of all the successful and unsuccessful patches:
| https://gist.github.com/arp242/0dc5dab0f7cd10e663cfc26866651...
|
| Ideally, it should also include the problem statement, but
| that's not in their JSON file and can't arsed to continue
| working on it - it's just a quick script I cooked up.
|
| I find it very hard to judge the quality of most of these
| patches because I'm not familiar with these projects.
|
| However, looking at the SWE-bench dataset I don't think it's
| representative of real-world issues, so "22% of real-world
| GitHub issues" is not really accurate regardless.
| wsdookadr wrote:
| What makes you say it's not representative?
| arp242 wrote:
| Look at the data. Does that seem like the average bug
| report to you?
| falcor84 wrote:
| It would help if you were to provide a specific example
| or two
| arp242 wrote:
| You can't demonstrate whether a dataset is representative
| or not by "an example or two". You need to look at all
| the data.
|
| And all of this is fine. It's just a benchmark suit and
| doesn't _need_ to be fully representative. The dataset
| itself doesn 't even claim to be that as far as I can
| find. All I'm saying that the title wasn't really
| accurate.
| skywhopper wrote:
| SWE-bench Lite is a subset of extremely simple issues from
| a cherry-picked subset (SWE-bench) of a handful of large
| (presumably well-run) Python-only projects.
|
| Here are some rules they used to trim down the SWE-bench
| Lite problems:
|
| * We remove instances with images, external hyperlinks,
| references to specific commit shas and references to other
| pull requests or issues.
|
| * We remove instances that have fewer than 40 words in the
| problem statement.
|
| * We remove instances that edit more than 1 file.
|
| * We remove instances where the gold patch has more than 3
| edit hunks (see patch).
|
| See https://www.swebench.com/lite.html
| kevindamm wrote:
| That's... rather limiting.
| yuntong wrote:
| The problem statement of each issue is included in each
| result folder as `problem_statement.txt` (such as:
| https://github.com/nus-apr/auto-code-
| rover/blob/main/results...).
|
| The developer patch for each issue is similarly included as
| `developer_patch.diff`.
| yourapostasy wrote:
| In short, no.
|
| The ArXiv paper mentions the human developer must supply a unit
| test (which can conceivably be coded with at least the
| assistance of an AI agent if not autonomously coded, but their
| experiment relies upon the former kind of unit test) that
| issues a pass-fail signal. So the 78% of failures are clearly
| identified, at the cost of implementing TDD for the Issue. The
| side effects story is punted upon, but I'd still take this over
| the nothing we have today.
|
| Of course, over a relatively short amount of time using this,
| I'd expect to experience the 22% (or whatever the real rate is)
| success rate to drop asymptotically towards zero as the low
| hanging fruit of the approach are mined out and it becomes kind
| of like another linter in our CICD pipelines.
|
| The impact of this tooling upon staff skills development will
| be interesting to say the least.
| yuntong wrote:
| AutoCodeRover does not require or assume a unit test to
| generate patches. The results discussed in Section 6.1 of the
| ArXiv paper are generated without any unit test. The unit
| tests are used by SWE-bench, when evaluating the correctness
| of AutoCodeRover-generated patches.
|
| That being said, when some unit tests are available (either
| written by developers or with assistance from other tools),
| AutoCodeRover can make use of them to perform some analysis
| like Spectrum-based Fault Localization (SBFL). This kind of
| analysis output can help the agent in pinpointing locations
| to fix. (Please see Section 6.2 for the analysis on SBFL.)
| dboreham wrote:
| > AutoCodeRover does not require or assume a unit test to
| generate patches.
|
| You have this backwards : it's traditional (at least in the
| past 15 years or so) to have a test to go along with every
| code change. The idea is that the test proves a) the bug
| existed prior to the fix and b) the bug is not there after
| the fix is applied. Commenters here are noting that ACR
| generates fixes but not tests.
| yuntong wrote:
| The previous comment was to describe the experiment
| settings. AutoCodeRover currently generates patches.
| Auto-generating high quality tests can be a parallel
| effort and another direction to explore. These efforts
| can eventually be used together.
| dboreham wrote:
| The point is that a patch without a test is not generally
| a useful thing. How do we know the AI generated patches
| are valid?
| nfm wrote:
| I agree in principle, but if it also generated a test,
| how would you know that was valid?
|
| The value I get from copilot is the ability to code
| _faster_ , not the ability _to_ code.
| abhikrc wrote:
| The short answer is that unit tests are not needed in
| AutoCodeRover. The technique proceeds by a sophisticated code
| search starting from the Github issue. tests are not needed.
| The code search helps in setting the context for LLM agents -
| which can help in the patch construction.
|
| If tests are available, they can give additional help in
| setting code context. But tests are not needed, and most of
| Github issues are solved without tests.
|
| All experimental numbers appear in the arxiv paper. Please
| let us know if you have more questions.
| dboreham wrote:
| > tests are not needed
|
| Strong words!
| wsdookadr wrote:
| There's actual SWE jobs where humans sift through this kind of
| noise. Someone told me they worked such a job recently. It's a
| good tool to add pressure and raise expectations. Maybe this is
| the future..
| skywhopper wrote:
| They only know the 22% number because unit tests to check for
| a fix are included in the benchmark. In other words, in a
| real world situation, the human would still need to double
| check. The patches this tool generates do not include
| appropriate tests or explanations and would never pass code
| review by a qualified human.
| DalasNoin wrote:
| Just about a week ago open devin got about 13% on this
| benchmark. Just give it a few more weeks.
|
| edit: apparently it's not the exact same benchmark but a
| similar one
| moolcool wrote:
| If it continues at this pace, then it'll solve 108% of GitHub
| issues in just 3 months
| Morelesshell wrote:
| if a ticket is open and AutoCodeRover just says "was unable to
| find something" its still better to have 22% fixed
| automatically.
| skywhopper wrote:
| But it doesn't say that. It submits a patch that doesn't
| solve the problem instead.
| maleldil wrote:
| LLMs are unable to say that they don't know something. They
| just generate nonsense.
| skywhopper wrote:
| Yes, and to be clear, the benchmark used here is merely the 300
| simplest problems in the larger benchmark suite, which itself
| is only a tiny subset of issues from a dozen large (and
| presumably well-curated) Python projects.
|
| Not to mention that making the code fix is only a tiny part of
| resolving an issue. There should also be explanations and added
| test cases. In other words, I doubt the 22% of "fixes" would
| pass review by the project owner if a human submitted them.
| mtlynch wrote:
| > _As an example, AutoCodeRover successfully fixed issue #32347
| of Django._
|
| This bug was fixed three years ago in a one-line change.[0]
| Presumably the fix was already in the training data.
|
| [0] https://github.com/django/django/pull/13933
| dboreham wrote:
| I wondered that too, but the fix it produces is not the same.
|
| Another thing that seemed odd is the English style used in the
| responses (watch the video full screen and you can read it).
| lewhoo wrote:
| My understanding is all of the issues on SWE-bench have at
| least a corresponding pull request.
| skywhopper wrote:
| The important detail is when the problem was solved. If it
| was three years ago, then it was likely captured in the
| training data for the model.
| withinboredom wrote:
| I would be interested to see how it performs on end-user software
| where bug reports are nebulous at best, ridiculous at worst.
| Furthermore, most of those fixes tend to be upstream bugs and
| very rarely anything to do with the actual software.
| abhikrc wrote:
| The entire setup is available for inspection, please see
|
| https://github.com/nus-apr/auto-code-rover
|
| if you need example bugs we can provide that too. Some examples
| also appear in the arxiv paper, please see
|
| https://arxiv.org/abs/2404.05427
| joenot443 wrote:
| This is super fascinating stuff, excellent work. As most of
| us don't have the time to read the entirety of the paper, are
| you able to directly link to some issues which have been
| landed and closed? Some personal favorites would be awesome
| :)
|
| I think I speak for others when I say the best way to judge
| the efficacy of this project is some real-world, on-site
| examples of it being used in prod. I'm especially curious for
| its performance in feature-request or flakey bug report type
| issues as opposed to reliable test failures. I expect the
| former is much tougher!
| dboreham wrote:
| fwiw the example issue highlighted in the post was already
| fixed by a human 3 years ago so I wouldn't expect to see
| much in the way of real life fixed issues yet.
| yuntong wrote:
| Thank you for your interest. There are some interesting
| examples in the SWE-bench-lite benchmark which are resolved
| by AutoCodeRover:
|
| - From sympy: https://github.com/sympy/sympy/issues/13643.
| AutoCodeRover's patch for it: https://github.com/nus-
| apr/auto-code-rover/blob/main/results...
|
| - Another one from scikit-learn: https://github.com/scikit-
| learn/scikit-learn/issues/13070. AutoCodeRover's patch
| (https://github.com/nus-apr/auto-code-
| rover/blob/main/results...) modified a few lines below
| (compared to the developer patch) and wrote a different
| comment.
|
| There are more examples in the results directory
| (https://github.com/nus-apr/auto-code-
| rover/tree/main/results).
| dboreham wrote:
| What's actually going on here? I watched the video of the example
| problem solution and it looks like either magic or fake. It
| doesn't produce the same PR as the real bug fix.
| abhikrc wrote:
| The entire setup is available for inspection from
|
| https://github.com/nus-apr/auto-code-rover
|
| Please try it out and send emails to the contact emails in this
| webpage, if you have any questions.
| dboreham wrote:
| Ok thanks. I haven't run it yet, but this does tell me that
| it's using OpenAI.
|
| Is it expected to be able to solve arbitrary (simple) bugs,
| or only the list of bugs in the benchmark set?
| noname120 wrote:
| Author published a ready-to-use Docker image:
| https://hub.docker.com/r/yuntongzhang/auto-code-rover/tags
| rrr_oh_man wrote:
| So this works for repositories with decent unit tests.
|
| Which excludes 80+% of real world bug and feature issues, in my
| experience...
| abhikrc wrote:
| No it does not need an unit test to work. We responded to
| another similar question from another user.
| prmph wrote:
| Then how can you have confidence that it actually fixes the
| bug? It means you still need a human to review the fix, no?
| yuntong wrote:
| The developer written testcases are provided in SWE-bench-
| lite so those could be used to check the generated patches.
|
| The auto-generated patches are to reduce the effort of
| resolving issues. In practice, they should be reviewed and
| verified by human developers before they are integrated.
| rrr_oh_man wrote:
| Thank you for the clarification. And shame on me for talking
| out of my ass!
| helboi4 wrote:
| 22% is a hilariously low percentage to use as a tagline. I do
| hope it gets better.
| cromka wrote:
| 22% less issues for free is bad?
| lionkor wrote:
| No, but 22% of tackled issues not being resolved correctly
| hints at how bad it is. Id guess that in these 22%, most of
| them have bugs or miss edge cases, considering it completely
| failed to solve 80%.
|
| If someone gets 20% on an exam I don't go "great, thats 20%
| of the way there!!!", instead I go "you clearly didnt attend,
| try again next time".
| bigyikes wrote:
| > If someone gets 20% on an exam I don't go "great, thats
| 20% of the way there!!!", instead I go "you clearly didnt
| attend, try again next time".
|
| Sure, if little Bobby gets a 20% I'll whoop his ass, but if
| the inanimate hunk of metal on my desk gets a 20% I might
| start to take notice.
| helboi4 wrote:
| Sure, its sorta cool and promising for technology in the
| future. But as of now, we don't know how badly it might
| fuck up the other 78% of cases. If it fucks up like 50%
| of cases so badly that it takes more time for the devs to
| fix than it would usually, then it's a liability.
| dboreham wrote:
| Fewer.
| adastra22 wrote:
| It's not free.
| pnathan wrote:
| How well does AutoCodeRover work in relation to compiled
| languages such as Java, Go, or Rust?
|
| The local code search idea to get around context limits is cool.
| Have you experimented with Anthropic's models for the larger
| context limit and dropping the code search?
| bilater wrote:
| Excited to see how badly the comments here age over the next few
| months.
___________________________________________________________________
(page generated 2024-04-09 23:01 UTC)