[HN Gopher] AutoCodeRover: Autonomous Program Improvement
       ___________________________________________________________________
        
       AutoCodeRover: Autonomous Program Improvement
        
       Author : mechtaev
       Score  : 87 points
       Date   : 2024-04-09 10:56 UTC (12 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | mechtaev wrote:
       | https://arxiv.org/abs/2404.05427
        
       | draugadrotten wrote:
       | did someone here replicate this on their own code?
        
         | wsdookadr wrote:
         | at the time of writing this their repo is 12h old. the training
         | time isn't stated in the paper. i'm thinking maybe one of these
         | robots can replicate this and tell us how it went.
        
         | dboreham wrote:
         | See post above. It is expected to be runnable any anyone from
         | the git repo contents.
        
       | juujian wrote:
       | And the other 78% of time it just creates a bunch of noise that
       | someone has to sift through?
        
         | egeozcan wrote:
         | That's in my experience better than the percentage of although
         | usually good-intentioned but nevertheless unusable PRs popular
         | repositories get.
        
         | ectopasm83 wrote:
         | The point is that the success rate is progressing, paper after
         | paper
         | 
         | > The baseline results of Magis (10%), Devin (14%) are
         | evaluated in another subset of SWE-bench, which we cannot
         | directly compare with, so we take the results from their
         | technical reports as a reference.
         | 
         | Wondering how it compares with these models.
        
           | invalidusernam3 wrote:
           | Why not use AutoCodeRover, Magis, and Devin together for 46%
           | 
           | /s
        
         | arp242 wrote:
         | Here's a list of all the successful and unsuccessful patches:
         | https://gist.github.com/arp242/0dc5dab0f7cd10e663cfc26866651...
         | 
         | Ideally, it should also include the problem statement, but
         | that's not in their JSON file and can't arsed to continue
         | working on it - it's just a quick script I cooked up.
         | 
         | I find it very hard to judge the quality of most of these
         | patches because I'm not familiar with these projects.
         | 
         | However, looking at the SWE-bench dataset I don't think it's
         | representative of real-world issues, so "22% of real-world
         | GitHub issues" is not really accurate regardless.
        
           | wsdookadr wrote:
           | What makes you say it's not representative?
        
             | arp242 wrote:
             | Look at the data. Does that seem like the average bug
             | report to you?
        
               | falcor84 wrote:
               | It would help if you were to provide a specific example
               | or two
        
               | arp242 wrote:
               | You can't demonstrate whether a dataset is representative
               | or not by "an example or two". You need to look at all
               | the data.
               | 
               | And all of this is fine. It's just a benchmark suit and
               | doesn't _need_ to be fully representative. The dataset
               | itself doesn 't even claim to be that as far as I can
               | find. All I'm saying that the title wasn't really
               | accurate.
        
             | skywhopper wrote:
             | SWE-bench Lite is a subset of extremely simple issues from
             | a cherry-picked subset (SWE-bench) of a handful of large
             | (presumably well-run) Python-only projects.
             | 
             | Here are some rules they used to trim down the SWE-bench
             | Lite problems:
             | 
             | * We remove instances with images, external hyperlinks,
             | references to specific commit shas and references to other
             | pull requests or issues.
             | 
             | * We remove instances that have fewer than 40 words in the
             | problem statement.
             | 
             | * We remove instances that edit more than 1 file.
             | 
             | * We remove instances where the gold patch has more than 3
             | edit hunks (see patch).
             | 
             | See https://www.swebench.com/lite.html
        
               | kevindamm wrote:
               | That's... rather limiting.
        
           | yuntong wrote:
           | The problem statement of each issue is included in each
           | result folder as `problem_statement.txt` (such as:
           | https://github.com/nus-apr/auto-code-
           | rover/blob/main/results...).
           | 
           | The developer patch for each issue is similarly included as
           | `developer_patch.diff`.
        
         | yourapostasy wrote:
         | In short, no.
         | 
         | The ArXiv paper mentions the human developer must supply a unit
         | test (which can conceivably be coded with at least the
         | assistance of an AI agent if not autonomously coded, but their
         | experiment relies upon the former kind of unit test) that
         | issues a pass-fail signal. So the 78% of failures are clearly
         | identified, at the cost of implementing TDD for the Issue. The
         | side effects story is punted upon, but I'd still take this over
         | the nothing we have today.
         | 
         | Of course, over a relatively short amount of time using this,
         | I'd expect to experience the 22% (or whatever the real rate is)
         | success rate to drop asymptotically towards zero as the low
         | hanging fruit of the approach are mined out and it becomes kind
         | of like another linter in our CICD pipelines.
         | 
         | The impact of this tooling upon staff skills development will
         | be interesting to say the least.
        
           | yuntong wrote:
           | AutoCodeRover does not require or assume a unit test to
           | generate patches. The results discussed in Section 6.1 of the
           | ArXiv paper are generated without any unit test. The unit
           | tests are used by SWE-bench, when evaluating the correctness
           | of AutoCodeRover-generated patches.
           | 
           | That being said, when some unit tests are available (either
           | written by developers or with assistance from other tools),
           | AutoCodeRover can make use of them to perform some analysis
           | like Spectrum-based Fault Localization (SBFL). This kind of
           | analysis output can help the agent in pinpointing locations
           | to fix. (Please see Section 6.2 for the analysis on SBFL.)
        
             | dboreham wrote:
             | > AutoCodeRover does not require or assume a unit test to
             | generate patches.
             | 
             | You have this backwards : it's traditional (at least in the
             | past 15 years or so) to have a test to go along with every
             | code change. The idea is that the test proves a) the bug
             | existed prior to the fix and b) the bug is not there after
             | the fix is applied. Commenters here are noting that ACR
             | generates fixes but not tests.
        
               | yuntong wrote:
               | The previous comment was to describe the experiment
               | settings. AutoCodeRover currently generates patches.
               | Auto-generating high quality tests can be a parallel
               | effort and another direction to explore. These efforts
               | can eventually be used together.
        
               | dboreham wrote:
               | The point is that a patch without a test is not generally
               | a useful thing. How do we know the AI generated patches
               | are valid?
        
               | nfm wrote:
               | I agree in principle, but if it also generated a test,
               | how would you know that was valid?
               | 
               | The value I get from copilot is the ability to code
               | _faster_ , not the ability _to_ code.
        
           | abhikrc wrote:
           | The short answer is that unit tests are not needed in
           | AutoCodeRover. The technique proceeds by a sophisticated code
           | search starting from the Github issue. tests are not needed.
           | The code search helps in setting the context for LLM agents -
           | which can help in the patch construction.
           | 
           | If tests are available, they can give additional help in
           | setting code context. But tests are not needed, and most of
           | Github issues are solved without tests.
           | 
           | All experimental numbers appear in the arxiv paper. Please
           | let us know if you have more questions.
        
             | dboreham wrote:
             | > tests are not needed
             | 
             | Strong words!
        
         | wsdookadr wrote:
         | There's actual SWE jobs where humans sift through this kind of
         | noise. Someone told me they worked such a job recently. It's a
         | good tool to add pressure and raise expectations. Maybe this is
         | the future..
        
           | skywhopper wrote:
           | They only know the 22% number because unit tests to check for
           | a fix are included in the benchmark. In other words, in a
           | real world situation, the human would still need to double
           | check. The patches this tool generates do not include
           | appropriate tests or explanations and would never pass code
           | review by a qualified human.
        
         | DalasNoin wrote:
         | Just about a week ago open devin got about 13% on this
         | benchmark. Just give it a few more weeks.
         | 
         | edit: apparently it's not the exact same benchmark but a
         | similar one
        
           | moolcool wrote:
           | If it continues at this pace, then it'll solve 108% of GitHub
           | issues in just 3 months
        
         | Morelesshell wrote:
         | if a ticket is open and AutoCodeRover just says "was unable to
         | find something" its still better to have 22% fixed
         | automatically.
        
           | skywhopper wrote:
           | But it doesn't say that. It submits a patch that doesn't
           | solve the problem instead.
        
           | maleldil wrote:
           | LLMs are unable to say that they don't know something. They
           | just generate nonsense.
        
         | skywhopper wrote:
         | Yes, and to be clear, the benchmark used here is merely the 300
         | simplest problems in the larger benchmark suite, which itself
         | is only a tiny subset of issues from a dozen large (and
         | presumably well-curated) Python projects.
         | 
         | Not to mention that making the code fix is only a tiny part of
         | resolving an issue. There should also be explanations and added
         | test cases. In other words, I doubt the 22% of "fixes" would
         | pass review by the project owner if a human submitted them.
        
       | mtlynch wrote:
       | > _As an example, AutoCodeRover successfully fixed issue #32347
       | of Django._
       | 
       | This bug was fixed three years ago in a one-line change.[0]
       | Presumably the fix was already in the training data.
       | 
       | [0] https://github.com/django/django/pull/13933
        
         | dboreham wrote:
         | I wondered that too, but the fix it produces is not the same.
         | 
         | Another thing that seemed odd is the English style used in the
         | responses (watch the video full screen and you can read it).
        
         | lewhoo wrote:
         | My understanding is all of the issues on SWE-bench have at
         | least a corresponding pull request.
        
           | skywhopper wrote:
           | The important detail is when the problem was solved. If it
           | was three years ago, then it was likely captured in the
           | training data for the model.
        
       | withinboredom wrote:
       | I would be interested to see how it performs on end-user software
       | where bug reports are nebulous at best, ridiculous at worst.
       | Furthermore, most of those fixes tend to be upstream bugs and
       | very rarely anything to do with the actual software.
        
         | abhikrc wrote:
         | The entire setup is available for inspection, please see
         | 
         | https://github.com/nus-apr/auto-code-rover
         | 
         | if you need example bugs we can provide that too. Some examples
         | also appear in the arxiv paper, please see
         | 
         | https://arxiv.org/abs/2404.05427
        
           | joenot443 wrote:
           | This is super fascinating stuff, excellent work. As most of
           | us don't have the time to read the entirety of the paper, are
           | you able to directly link to some issues which have been
           | landed and closed? Some personal favorites would be awesome
           | :)
           | 
           | I think I speak for others when I say the best way to judge
           | the efficacy of this project is some real-world, on-site
           | examples of it being used in prod. I'm especially curious for
           | its performance in feature-request or flakey bug report type
           | issues as opposed to reliable test failures. I expect the
           | former is much tougher!
        
             | dboreham wrote:
             | fwiw the example issue highlighted in the post was already
             | fixed by a human 3 years ago so I wouldn't expect to see
             | much in the way of real life fixed issues yet.
        
             | yuntong wrote:
             | Thank you for your interest. There are some interesting
             | examples in the SWE-bench-lite benchmark which are resolved
             | by AutoCodeRover:
             | 
             | - From sympy: https://github.com/sympy/sympy/issues/13643.
             | AutoCodeRover's patch for it: https://github.com/nus-
             | apr/auto-code-rover/blob/main/results...
             | 
             | - Another one from scikit-learn: https://github.com/scikit-
             | learn/scikit-learn/issues/13070. AutoCodeRover's patch
             | (https://github.com/nus-apr/auto-code-
             | rover/blob/main/results...) modified a few lines below
             | (compared to the developer patch) and wrote a different
             | comment.
             | 
             | There are more examples in the results directory
             | (https://github.com/nus-apr/auto-code-
             | rover/tree/main/results).
        
       | dboreham wrote:
       | What's actually going on here? I watched the video of the example
       | problem solution and it looks like either magic or fake. It
       | doesn't produce the same PR as the real bug fix.
        
         | abhikrc wrote:
         | The entire setup is available for inspection from
         | 
         | https://github.com/nus-apr/auto-code-rover
         | 
         | Please try it out and send emails to the contact emails in this
         | webpage, if you have any questions.
        
           | dboreham wrote:
           | Ok thanks. I haven't run it yet, but this does tell me that
           | it's using OpenAI.
           | 
           | Is it expected to be able to solve arbitrary (simple) bugs,
           | or only the list of bugs in the benchmark set?
        
       | noname120 wrote:
       | Author published a ready-to-use Docker image:
       | https://hub.docker.com/r/yuntongzhang/auto-code-rover/tags
        
       | rrr_oh_man wrote:
       | So this works for repositories with decent unit tests.
       | 
       | Which excludes 80+% of real world bug and feature issues, in my
       | experience...
        
         | abhikrc wrote:
         | No it does not need an unit test to work. We responded to
         | another similar question from another user.
        
           | prmph wrote:
           | Then how can you have confidence that it actually fixes the
           | bug? It means you still need a human to review the fix, no?
        
             | yuntong wrote:
             | The developer written testcases are provided in SWE-bench-
             | lite so those could be used to check the generated patches.
             | 
             | The auto-generated patches are to reduce the effort of
             | resolving issues. In practice, they should be reviewed and
             | verified by human developers before they are integrated.
        
           | rrr_oh_man wrote:
           | Thank you for the clarification. And shame on me for talking
           | out of my ass!
        
       | helboi4 wrote:
       | 22% is a hilariously low percentage to use as a tagline. I do
       | hope it gets better.
        
         | cromka wrote:
         | 22% less issues for free is bad?
        
           | lionkor wrote:
           | No, but 22% of tackled issues not being resolved correctly
           | hints at how bad it is. Id guess that in these 22%, most of
           | them have bugs or miss edge cases, considering it completely
           | failed to solve 80%.
           | 
           | If someone gets 20% on an exam I don't go "great, thats 20%
           | of the way there!!!", instead I go "you clearly didnt attend,
           | try again next time".
        
             | bigyikes wrote:
             | > If someone gets 20% on an exam I don't go "great, thats
             | 20% of the way there!!!", instead I go "you clearly didnt
             | attend, try again next time".
             | 
             | Sure, if little Bobby gets a 20% I'll whoop his ass, but if
             | the inanimate hunk of metal on my desk gets a 20% I might
             | start to take notice.
        
               | helboi4 wrote:
               | Sure, its sorta cool and promising for technology in the
               | future. But as of now, we don't know how badly it might
               | fuck up the other 78% of cases. If it fucks up like 50%
               | of cases so badly that it takes more time for the devs to
               | fix than it would usually, then it's a liability.
        
           | dboreham wrote:
           | Fewer.
        
           | adastra22 wrote:
           | It's not free.
        
       | pnathan wrote:
       | How well does AutoCodeRover work in relation to compiled
       | languages such as Java, Go, or Rust?
       | 
       | The local code search idea to get around context limits is cool.
       | Have you experimented with Anthropic's models for the larger
       | context limit and dropping the code search?
        
       | bilater wrote:
       | Excited to see how badly the comments here age over the next few
       | months.
        
       ___________________________________________________________________
       (page generated 2024-04-09 23:01 UTC)