[HN Gopher] Princeton group open sources "SWE-agent", with 12% f...
       ___________________________________________________________________
        
       Princeton group open sources "SWE-agent", with 12% fix rate for
       GitHub issues
        
       Author : asteroidz
       Score  : 140 points
       Date   : 2024-04-02 16:16 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | bwestergard wrote:
       | Friendly suggestion to the authors: success rates aren't
       | meaningful to all but a handful of researchers. They should add a
       | few examples of tests SWE-agent passed and did not pass to the
       | README.
        
         | nyrikki wrote:
         | Yes please, the code quality on Devin was incredibly poor in
         | all examples I traced down.
         | 
         | At least from a maintainability perspective.
         | 
         | I would like to see if this implementation is less destructive
         | or at least more suitable for a red-green-refactor workflow.
        
         | NegativeLatency wrote:
         | Unless you weren't actually that successful but need to publish
         | a "successful" result
        
       | mdaniel wrote:
       | I think that "Demo" link is just an extremely annoying version of
       | an HTML presentation, so they could save me a shitload of
       | clicking if they just dumped their presentation out to a PDF or
       | whatever so I could read faster than watching it type out text as
       | if it was live. It also whines a lot on the console about its
       | inability to connect to a websocket server on 3000 but I don't
       | know what it would do with a websocket connection if had it
        
       | matthewaveryusa wrote:
       | Very neat. Uses the langchain method, here are some of the
       | prompts:
       | 
       | https://github.com/princeton-nlp/SWE-agent/blob/main/config/...
        
         | toddmorey wrote:
         | I'm always fascinated to read the system prompts & I always
         | wonder what sort of gains can be made optimizing them further.
         | 
         | Once I'm back on desktop I want to look at the gut history of
         | this file.
        
           | clement_b wrote:
           | I have a git feeling this comment was written on mobile.
        
       | iLoveOncall wrote:
       | And creates how many new ones?
       | 
       | This and Devin generate garbage code that will make any codebase
       | worse.
       | 
       | It's a joke that 12.5% is even associated with the word
       | "success".
        
         | 1letterunixname wrote:
         | Do spaces and spelling fixes count?
         | 
         | Copilot, so far, is only good for predicting the next bit of
         | similar patterns of code
        
       | noncoml wrote:
       | If you are afraid that LLMs will replace you at your job, ask an
       | LLM to write Rust code for reading a utf8 file character by
       | character
       | 
       | Edit: Yes, it does write some code that is "close" enough, but in
       | some cases it is wrong, in others it doesn't not do exactly what
       | asked. I.e. needs supervision from someone who understands both
       | the requirements, the code and the problems that may arise from
       | the naive line that the LLM is taking. Mind you, the most popular
       | the issue, the better the line LLM is taking. So in other words,
       | IMHO is a glorified Stack Overflow. Just as there are engineers
       | that copy-paste from SO without having any idea what the code
       | does, there will be engineers that will just copy paste from LLM.
       | Their work will be much better than if they used SO, but I think
       | it's still nowhere to the mark of a Senior SWE and above.
        
         | iwontberude wrote:
         | Hypothetically, which ticker symbols would you buy put
         | contracts on, at what strike prices, and at what expiration
         | dates? As far as I can tell, a lot of people are betting a lot
         | of money that you are wrong, but actually I think you are
         | right.
        
           | jeremyjh wrote:
           | The most relevant companies focused on this aren't publicly
           | traded. The ones that are publicly traded like MSFT have way
           | too many other factors affecting their value - not to mention
           | the fact that they'll make money on generative AI that has
           | nothing to do with coding regardless of if an SWE-agent ever
           | works.
        
           | noncoml wrote:
           | Ugh, I am not claiming that LLMs are not great innovation.
           | Just that they are not going to replace SWE jobs in our(maybe
           | my) lifetime.
        
         | vineyardmike wrote:
         | Yea the problem with that is the control group - grab any SWE
         | and ask them the same thing. I don't think most would pass.
         | Unless you want to give an SWE time to learn... then it's
         | hardly fair. And I vaguely trust the LLM to be able to learn it
         | too.
         | 
         | Also I just asked Claude and Gemini and they both provided an
         | implementation that matches the "bytes to UTF-8" rust docs.
         | Assuming those are right,LLMs can do this (but I haven't tested
         | the code).
         | 
         | https://doc.rust-lang.org/std/string/struct.String.html
        
         | raggi wrote:
         | it does an ok job with this task:                   use
         | std::fs::File;         use std::io::{self, BufReader, Read};
         | fn read_file_character_by_character(path: &str) ->
         | io::Result<()> {             // Open the file in read-only
         | mode.             let file = File::open(path)?;
         | // Create a buffered reader to read the file more efficiently.
         | let reader = BufReader::new(file);                  // `chars`
         | method returns an iterator over the characters of the input.
         | // Note that it returns a Result<(char, usize), io::Error>,
         | where usize is the byte length of the char.             for
         | char_result in reader.chars() {                 match
         | char_result {                     Ok(c) => print!("{}", c),
         | Err(e) => return Err(e),                 }             }
         | Ok(())         }              fn main() {             let path
         | = "path/to/your/file.txt";                  if let Err(e) =
         | read_file_character_by_character(path) {
         | eprintln!("Error reading file: {}", e);             }         }
        
           | woodruffw wrote:
           | If we're being sticklers, this isn't reading character-by-
           | character: it's performing a buffered read, which then gets
           | iterated over.
        
             | raggi wrote:
             | here's what it does if i feed your nit pick into it
             | verbatim:                   use std::fs::File;         use
             | std::io::{self, Read};         use std::str::from_utf8;
             | fn read_file_character_by_character(path: &str) ->
             | io::Result<()> {             // Open the file in read-only
             | mode.             let mut file = File::open(path)?;
             | let mut buffer = [0; 4]; // UTF-8 character max byte length
             | let mut temp_buffer = vec![];                  while let
             | Ok(bytes_read) = file.read(&mut buffer[..1]) { // Read byte
             | by byte                 if bytes_read == 0 { // End of file
             | break;                 }
             | temp_buffer.push(buffer[0]);                      if let
             | Ok(s) = from_utf8(&temp_buffer) {                     if
             | s.chars().count() == 1 { // Complete character read
             | print!("{}", s);
             | temp_buffer.clear(); // Clear temp buffer for next
             | character                     }                 }
             | }                  Ok(())         }              fn main()
             | {             let path = "path/to/your/file.txt";
             | if let Err(e) = read_file_character_by_character(path) {
             | eprintln!("Error reading file: {}", e);             }
             | }
        
               | comex wrote:
               | Unlike the original version, this version compiles and
               | seems to basically work correctly. However, the design is
               | misleading: `buffer` is declared as an array of 4 bytes
               | but only the first byte is ever used. The code also has
               | suboptimal performance and error handling, though that's
               | not the end of the world.
        
               | raggi wrote:
               | all true, as I said in another fork of the thread, this
               | comes down to part of what humans will still be valuable
               | for in this loop: distilling poor requirements into
               | better requirements.
        
             | noncoml wrote:
             | I wouldn't say it's a nit. The file may be 10s of GB. Do
             | you want to read it to a string?
        
             | deathanatos wrote:
             | The original prompt is a bit under-specified. (But hey,
             | that certainly matches the real world!)
             | 
             | You're going to have to buffer at least a little, to figure
             | out where the USV / grapheme boundary is, depending on our
             | definition of "character". To me, a BufReader is
             | appropriate here; it avoids lots of tiny reads to the
             | kernel, which is _probably_ the right behavior in a real
             | case.
             | 
             | To me, "read character by character" vaguely implies
             | something that's going to yield a stream of characters.
             | (Again, for some definition there.)
        
           | raggi wrote:
           | fwiw, the benchmark that matters really has nothing to do
           | with authoring code.
           | 
           | the typing of code is the easy part even though it's a part a
           | lot of folks are somewhat addicted to.
           | 
           | the things which have far more value are applying value
           | judgements to requirements, correlating and incorporating
           | sparse and inaccurate diagnostic information into a coherent
           | debugging strategy, and so on. there will come a time when it
           | can assist with these too, probably first on requirements
           | distillation, but for more complex debugging tasks that's a
           | novel problem solving area that we've yet to see substantial
           | movement on.
           | 
           | so if you want to stave off the robots coming for you, get
           | good at debugging hard problems, and learn to make really
           | great use of tools that accelerate the typing out of
           | solutions to baseline product requirements.
        
           | comex wrote:
           | Only problem is that the critical `chars` method doesn't
           | actually exist. Rust's standard library has a `chars` method
           | for strings, but not for Readers.
           | 
           | (Also, the comment about the iterator element type is
           | inconsistent with the code following it. Based on the
           | comment, `c` would be of type `(char, usize)`, but then
           | trying to print it with {} would fail because tuples don't
           | implement Display.)
        
             | raggi wrote:
             | good catch. feeding it the error output of rustc it then
             | produces:                   use std::fs::File;         use
             | std::io::{self, Read};              fn
             | read_file_character_by_character(path: &str) ->
             | io::Result<()> {             let mut file =
             | File::open(path)?;             let mut contents =
             | String::new();                  file.read_to_string(&mut
             | contents)?;                  for c in contents.chars() {
             | println!("{}", c);             }                  Ok(())
             | }              fn main() {             let path =
             | "path/to/your/file.txt";                  if let Err(e) =
             | read_file_character_by_character(path) {
             | eprintln!("Error reading file: {}", e);             }
             | }
        
               | phaer wrote:
               | But this doesn't read the file char-by-char, but uses
               | buffering to read it into a string
        
               | scottlamb wrote:
               | What would you expect? There's no OS API for "read one
               | character", except in say ASCII where 1 byte = 1 code
               | point = 1 character. And it'd be hideously inefficient
               | anyway. So you either loop over getting the next N bytes
               | and getting all complete characters so far (with some
               | extra complexity around characters that cross chunk
               | boundaries) or you read the whole thing into a single
               | buffer and iterate the characters. It does the latter. If
               | it doesn't have the ability to respond by asking
               | requirements questions, I'd consider either choice valid.
               | 
               | Of course, in real life, I do expect to get requirements
               | questions back from an engineer when I assign a task.
               | Seems more practical than anticipating everything up-
               | front into the perfect specification/prompt. Why
               | shouldn't I expect the same from an LLM-based tool? Are
               | any of them set up to do that?
        
               | deathanatos wrote:
               | On errors, it exits with success.
        
         | DabbyDabberson wrote:
         | The way I see it, its undetermined if Generative AI will be
         | able to fully do a SWE job.
         | 
         | But, for most of the debates I've seen, I don't think it the
         | answer matters all too much.
         | 
         | Once we have models that can act as full senior SWEs.. the
         | models can engineer the models. And then we've hit the
         | recursive case.
         | 
         | Once models can engineer models better and faster than humans,
         | all bets are off. Its the foggy future. Its the singularity.
        
           | dvt wrote:
           | > Once we have models that can act as full senior SWEs.. the
           | models can engineer the models.
           | 
           | This is such an extremely bullish case, I'm not sure why
           | you'd think this is even remotely possible. A Google search
           | is usually more valuable than ChatGPT. For example, the rust
           | utf-8 example is already verbatim solved on reddit: https://w
           | ww.reddit.com/r/rust/comments/l5m1rw/how_can_i_effi...
        
       | lispisok wrote:
       | Their demo is so similar to the Devin one I had to go look up the
       | Devin one to check I wasnt watching the same demo. I feel like
       | there might be a reason they both picked Sympy. Also I rarely put
       | weight into demos. They are usually cherry-picked at best and
       | outright fabricated at worst. I want to hear what 3rd parties
       | have to say after trying these things.
        
         | lewhoo wrote:
         | Maybe that's the point of this research. Hey look, we
         | reproduced the way to game the stats a bit. I really can't tell
         | anymore.
        
       | Frummy wrote:
       | Interesting idea to provide the Agent-Computer Interface for it
       | to scroll and such, interact easier from its perspective
        
         | aussieguy1234 wrote:
         | Similar to how early computers didn't have enough ram to
         | display the whole text file, so old programmers had to work
         | with parts of the file at a time. It's not a bad way to get
         | around the context window problem, which is kind of similar.
        
       | unit_circle wrote:
       | A 1/8 chance of fixing a bug at the cost of a careful review and
       | some corrections is not bad.
       | 
       | 0% -> 12% improvement is not bad for two years either (I'm
       | somewhat arbitrary picking the release date of ChatGPT). If this
       | can be kept up for a few years we will have some extremely useful
       | tooling. The cost can be relatively high as well, since
       | engineering time is currently orders of magnitude more expensive
       | than these tools.
        
         | golergka wrote:
         | It's still abysmal from POV of actually using it in production,
         | but it's a very impressive rate of improvement. Given what
         | happened with LLMs and image generation in the last few years,
         | we can probably assume that these systems will be able to fix
         | most trivial bugs pretty soon.
        
         | stefan_ wrote:
         | These ,,benchmark" are tuned around reporting some exciting
         | result, once you look inside, all the ,,fixes" are trash.
        
         | blharr wrote:
         | I still don't know. I feel like there are many ways where GPT
         | will write some code or fix a bug in a way that makes it
         | significantly harder to debug. Even for relatively simple
         | tasks, it's kind of like machine-generated code that I would
         | not want to touch.
        
           | WanderPanda wrote:
           | It is a bit worrisome but we manage to deal with subpar human
           | code as well. Often the boilerplate generated by ChatGPT is
           | already better than what an unexperienced coder would string
           | together. I'm sure it will not be a free lunch but the the
           | benefits will probably outweigh the downsides.
           | 
           | Interesting scalability questions will arise wrt to security
           | when scaling the already unmanagably large code bases by
           | another magnitude (or two), though.
        
       | anotherpaulg wrote:
       | Very cool project!
       | 
       | I've experimented in this direction previously, but found agentic
       | behavior is often chaotic and leads to long expensive sessions
       | that go down a wrong rabbit hole and ultimately fail.
       | 
       | It's great that you succeed on 12% of swe-bench, but what happens
       | the other 88% of the time? Is it useless wasted work and token
       | costs? Or does it make useful progress that can be salvaged?
       | 
       | Also, I think swe-bench is from your group, right? Have you made
       | any attempt to characterize a "skilled human upper bound" score?
       | 
       | I randomly sampled a dozen swe-bench tasks myself, and found that
       | many were basically impossible for a skilled human to "solve".
       | Mainly because the tasks were under specified wrt to the hidden
       | test cases that determine passing. The tests were checking
       | implementation specific details from the repo's PR that weren't
       | actually stated requirements of the task.
        
         | a_wild_dandan wrote:
         | Personally, I'd just use one of my local MacBook models (e.g.
         | Mixtral 8x7b) and forget about any wasted branches & cents. My
         | debugging time costs many orders of magnitude more than SWE-
         | agent, so even a 5% backlog savings would be spectacular!
        
           | ein0p wrote:
           | I've tried this with another similar system. FOSS LLMs
           | including Mixtral are currently too weak to handle something
           | like this. For me they run out of steam after only a few
           | turns and start going in circles unproductively
        
           | swatcoder wrote:
           | > My debugging time costs many orders of magnitude more than
           | SWE-agent
           | 
           | Unless your job is primarily to clean up somebody else's
           | mess, your debugging time is a key part of a career-long
           | feedback loop that improves your craft. Be careful not to
           | shrug it off as something less. Many many people are spending
           | a lot of money to let you forget it, and once you do, you'll
           | be right there in the ranks of the cheaply replaceble.
           | 
           | (And on the odd chance that cleaning up other people's mess
           | _is_ your job, you should probably be the one doing it; and
           | for largely the same reasons)
        
           | Aperocky wrote:
           | That's assuming that the other 95% stays the same with this
           | new agent (vs creating more work for you to now also have to
           | parse what the model is saying).
        
       | trebligdivad wrote:
       | So this issues arbitrary shell commands based on trying to
       | understand the untrusted bug text ? Should be fun waiting until
       | someone finds an escape.
        
       | sumeruchat wrote:
       | Once we have this fully automated, any good developer could have
       | a team of 100 robo SWEs and ship like crazy. The real competition
       | is with those devs not with the bots.
        
         | recursive wrote:
         | Shipping like crazy isn't useful by itself. Shipping non-
         | garbage and being able to maintain it still has _some_ value.
        
       | rwmj wrote:
       | Do we know how much extra work it created for the real people who
       | had to review the proposed fixes?
        
         | r0ze-at-hn wrote:
         | Ah, well let me tell you about my pull request reviewer LLM
         | project.
        
       | aussieguy1234 wrote:
       | 12% fix rate = 88% bug rate
        
         | mlcrypto wrote:
         | Yep. After xz we don't need a bot mindlessly fixing all
         | suggestions from malicious actors
        
       | paradite wrote:
       | For anyone who didn't bother looking deeper, the SWEbench
       | benchmark contains only Python code projects, so it is not
       | representative of all the programing languages and frameworks.
       | 
       | I'm working on a more general SWE task eval framework in JS for
       | arbitrary language and framework now (for starter JS/TS, SQL and
       | Python), for my own prompt engineering product.
       | 
       | Hit me up if you are interested.
        
       | danenania wrote:
       | I'm working on a somewhat similar project:
       | https://github.com/plandex-ai/plandex
       | 
       | While the overall goal is to build arbitrarily large, complex
       | features and projects that are too much for ChatGPT or IDE-based
       | tools, another aspect that I've put a lot of focus on is how to
       | handle mistakes and corrections when the model starts going off
       | the rails. Changes are accumulated in a protected sandbox
       | separate from your project files, a diff review TUI is included
       | that allows for bad changes to be rejected, all actions are
       | version-controlled so you can easily go backwards and try a
       | different approach, and branches are also included for trying out
       | multiple approaches.
       | 
       | I think nailing this developer-AI feedback loop is the key to
       | getting authentic productivity gains. We shouldn't just ask how
       | well a coding tool can pass benchmarks, but what the failure case
       | looks like when things go wrong.
        
         | etheridev wrote:
         | You need to make yourself a business analyst agent to provide
         | the feedback! To make it real, perhaps a team of them with
         | conflicting personalities.
        
           | danenania wrote:
           | I think we'll get there at some point, but one thing I've
           | learned from this project is how difficult it is to stack AI
           | interactions. Each little bit of AI-based logic that gets
           | added tends to fail terribly at first. Only after a long
           | period of intense testing and iteration does it become
           | remotely usable. The more you are combining different kinds
           | of tasks, the more difficult it gets.
        
       | dimal wrote:
       | The demo shows a very clearly written bug report about a matrix
       | operation that's producing an unexpected output. Umm... no. Most
       | bug reports you get in the wild are more along the lines of "I
       | clicked on on X and Y happened" then if you're lucky they'll say
       | "and I expected Z". Usually the Z expectation is left for the
       | reader to fill in because as human users we understand the
       | expectations.
       | 
       | The difficulty in fixing a bug is in figuring out what's causing
       | the bug. If you know it's caused by an incorrect operation, and
       | we know that LLMs can fix simple defects like this, what does
       | this prove?
       | 
       | Has anyone dug through the paper yet to see what the rest of the
       | issues look like? And what the diffs look like? I suppose I'll
       | try when I have a sec.
        
         | drcode wrote:
         | > Most bug reports you get in the wild are more along the lines
         | of
         | 
         | Since this fixes 12% of the bugs, the authors of the paper
         | probably agree with you that 100-12= 88%, and hence "most bugs"
         | don't have nicely written but reports.
        
       | readthenotes1 wrote:
       | I made a lot of money as I was paid hourly while working with a
       | cadre of people I called "the defect generators".
       | 
       | I'm kind of sad that future generations will not have that
       | experience...
        
       ___________________________________________________________________
       (page generated 2024-04-02 23:00 UTC)