[HN Gopher] Princeton group open sources "SWE-agent", with 12% f...
___________________________________________________________________
Princeton group open sources "SWE-agent", with 12% fix rate for
GitHub issues
Author : asteroidz
Score : 140 points
Date : 2024-04-02 16:16 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| bwestergard wrote:
| Friendly suggestion to the authors: success rates aren't
| meaningful to all but a handful of researchers. They should add a
| few examples of tests SWE-agent passed and did not pass to the
| README.
| nyrikki wrote:
| Yes please, the code quality on Devin was incredibly poor in
| all examples I traced down.
|
| At least from a maintainability perspective.
|
| I would like to see if this implementation is less destructive
| or at least more suitable for a red-green-refactor workflow.
| NegativeLatency wrote:
| Unless you weren't actually that successful but need to publish
| a "successful" result
| mdaniel wrote:
| I think that "Demo" link is just an extremely annoying version of
| an HTML presentation, so they could save me a shitload of
| clicking if they just dumped their presentation out to a PDF or
| whatever so I could read faster than watching it type out text as
| if it was live. It also whines a lot on the console about its
| inability to connect to a websocket server on 3000 but I don't
| know what it would do with a websocket connection if had it
| matthewaveryusa wrote:
| Very neat. Uses the langchain method, here are some of the
| prompts:
|
| https://github.com/princeton-nlp/SWE-agent/blob/main/config/...
| toddmorey wrote:
| I'm always fascinated to read the system prompts & I always
| wonder what sort of gains can be made optimizing them further.
|
| Once I'm back on desktop I want to look at the gut history of
| this file.
| clement_b wrote:
| I have a git feeling this comment was written on mobile.
| iLoveOncall wrote:
| And creates how many new ones?
|
| This and Devin generate garbage code that will make any codebase
| worse.
|
| It's a joke that 12.5% is even associated with the word
| "success".
| 1letterunixname wrote:
| Do spaces and spelling fixes count?
|
| Copilot, so far, is only good for predicting the next bit of
| similar patterns of code
| noncoml wrote:
| If you are afraid that LLMs will replace you at your job, ask an
| LLM to write Rust code for reading a utf8 file character by
| character
|
| Edit: Yes, it does write some code that is "close" enough, but in
| some cases it is wrong, in others it doesn't not do exactly what
| asked. I.e. needs supervision from someone who understands both
| the requirements, the code and the problems that may arise from
| the naive line that the LLM is taking. Mind you, the most popular
| the issue, the better the line LLM is taking. So in other words,
| IMHO is a glorified Stack Overflow. Just as there are engineers
| that copy-paste from SO without having any idea what the code
| does, there will be engineers that will just copy paste from LLM.
| Their work will be much better than if they used SO, but I think
| it's still nowhere to the mark of a Senior SWE and above.
| iwontberude wrote:
| Hypothetically, which ticker symbols would you buy put
| contracts on, at what strike prices, and at what expiration
| dates? As far as I can tell, a lot of people are betting a lot
| of money that you are wrong, but actually I think you are
| right.
| jeremyjh wrote:
| The most relevant companies focused on this aren't publicly
| traded. The ones that are publicly traded like MSFT have way
| too many other factors affecting their value - not to mention
| the fact that they'll make money on generative AI that has
| nothing to do with coding regardless of if an SWE-agent ever
| works.
| noncoml wrote:
| Ugh, I am not claiming that LLMs are not great innovation.
| Just that they are not going to replace SWE jobs in our(maybe
| my) lifetime.
| vineyardmike wrote:
| Yea the problem with that is the control group - grab any SWE
| and ask them the same thing. I don't think most would pass.
| Unless you want to give an SWE time to learn... then it's
| hardly fair. And I vaguely trust the LLM to be able to learn it
| too.
|
| Also I just asked Claude and Gemini and they both provided an
| implementation that matches the "bytes to UTF-8" rust docs.
| Assuming those are right,LLMs can do this (but I haven't tested
| the code).
|
| https://doc.rust-lang.org/std/string/struct.String.html
| raggi wrote:
| it does an ok job with this task: use
| std::fs::File; use std::io::{self, BufReader, Read};
| fn read_file_character_by_character(path: &str) ->
| io::Result<()> { // Open the file in read-only
| mode. let file = File::open(path)?;
| // Create a buffered reader to read the file more efficiently.
| let reader = BufReader::new(file); // `chars`
| method returns an iterator over the characters of the input.
| // Note that it returns a Result<(char, usize), io::Error>,
| where usize is the byte length of the char. for
| char_result in reader.chars() { match
| char_result { Ok(c) => print!("{}", c),
| Err(e) => return Err(e), } }
| Ok(()) } fn main() { let path
| = "path/to/your/file.txt"; if let Err(e) =
| read_file_character_by_character(path) {
| eprintln!("Error reading file: {}", e); } }
| woodruffw wrote:
| If we're being sticklers, this isn't reading character-by-
| character: it's performing a buffered read, which then gets
| iterated over.
| raggi wrote:
| here's what it does if i feed your nit pick into it
| verbatim: use std::fs::File; use
| std::io::{self, Read}; use std::str::from_utf8;
| fn read_file_character_by_character(path: &str) ->
| io::Result<()> { // Open the file in read-only
| mode. let mut file = File::open(path)?;
| let mut buffer = [0; 4]; // UTF-8 character max byte length
| let mut temp_buffer = vec![]; while let
| Ok(bytes_read) = file.read(&mut buffer[..1]) { // Read byte
| by byte if bytes_read == 0 { // End of file
| break; }
| temp_buffer.push(buffer[0]); if let
| Ok(s) = from_utf8(&temp_buffer) { if
| s.chars().count() == 1 { // Complete character read
| print!("{}", s);
| temp_buffer.clear(); // Clear temp buffer for next
| character } }
| } Ok(()) } fn main()
| { let path = "path/to/your/file.txt";
| if let Err(e) = read_file_character_by_character(path) {
| eprintln!("Error reading file: {}", e); }
| }
| comex wrote:
| Unlike the original version, this version compiles and
| seems to basically work correctly. However, the design is
| misleading: `buffer` is declared as an array of 4 bytes
| but only the first byte is ever used. The code also has
| suboptimal performance and error handling, though that's
| not the end of the world.
| raggi wrote:
| all true, as I said in another fork of the thread, this
| comes down to part of what humans will still be valuable
| for in this loop: distilling poor requirements into
| better requirements.
| noncoml wrote:
| I wouldn't say it's a nit. The file may be 10s of GB. Do
| you want to read it to a string?
| deathanatos wrote:
| The original prompt is a bit under-specified. (But hey,
| that certainly matches the real world!)
|
| You're going to have to buffer at least a little, to figure
| out where the USV / grapheme boundary is, depending on our
| definition of "character". To me, a BufReader is
| appropriate here; it avoids lots of tiny reads to the
| kernel, which is _probably_ the right behavior in a real
| case.
|
| To me, "read character by character" vaguely implies
| something that's going to yield a stream of characters.
| (Again, for some definition there.)
| raggi wrote:
| fwiw, the benchmark that matters really has nothing to do
| with authoring code.
|
| the typing of code is the easy part even though it's a part a
| lot of folks are somewhat addicted to.
|
| the things which have far more value are applying value
| judgements to requirements, correlating and incorporating
| sparse and inaccurate diagnostic information into a coherent
| debugging strategy, and so on. there will come a time when it
| can assist with these too, probably first on requirements
| distillation, but for more complex debugging tasks that's a
| novel problem solving area that we've yet to see substantial
| movement on.
|
| so if you want to stave off the robots coming for you, get
| good at debugging hard problems, and learn to make really
| great use of tools that accelerate the typing out of
| solutions to baseline product requirements.
| comex wrote:
| Only problem is that the critical `chars` method doesn't
| actually exist. Rust's standard library has a `chars` method
| for strings, but not for Readers.
|
| (Also, the comment about the iterator element type is
| inconsistent with the code following it. Based on the
| comment, `c` would be of type `(char, usize)`, but then
| trying to print it with {} would fail because tuples don't
| implement Display.)
| raggi wrote:
| good catch. feeding it the error output of rustc it then
| produces: use std::fs::File; use
| std::io::{self, Read}; fn
| read_file_character_by_character(path: &str) ->
| io::Result<()> { let mut file =
| File::open(path)?; let mut contents =
| String::new(); file.read_to_string(&mut
| contents)?; for c in contents.chars() {
| println!("{}", c); } Ok(())
| } fn main() { let path =
| "path/to/your/file.txt"; if let Err(e) =
| read_file_character_by_character(path) {
| eprintln!("Error reading file: {}", e); }
| }
| phaer wrote:
| But this doesn't read the file char-by-char, but uses
| buffering to read it into a string
| scottlamb wrote:
| What would you expect? There's no OS API for "read one
| character", except in say ASCII where 1 byte = 1 code
| point = 1 character. And it'd be hideously inefficient
| anyway. So you either loop over getting the next N bytes
| and getting all complete characters so far (with some
| extra complexity around characters that cross chunk
| boundaries) or you read the whole thing into a single
| buffer and iterate the characters. It does the latter. If
| it doesn't have the ability to respond by asking
| requirements questions, I'd consider either choice valid.
|
| Of course, in real life, I do expect to get requirements
| questions back from an engineer when I assign a task.
| Seems more practical than anticipating everything up-
| front into the perfect specification/prompt. Why
| shouldn't I expect the same from an LLM-based tool? Are
| any of them set up to do that?
| deathanatos wrote:
| On errors, it exits with success.
| DabbyDabberson wrote:
| The way I see it, its undetermined if Generative AI will be
| able to fully do a SWE job.
|
| But, for most of the debates I've seen, I don't think it the
| answer matters all too much.
|
| Once we have models that can act as full senior SWEs.. the
| models can engineer the models. And then we've hit the
| recursive case.
|
| Once models can engineer models better and faster than humans,
| all bets are off. Its the foggy future. Its the singularity.
| dvt wrote:
| > Once we have models that can act as full senior SWEs.. the
| models can engineer the models.
|
| This is such an extremely bullish case, I'm not sure why
| you'd think this is even remotely possible. A Google search
| is usually more valuable than ChatGPT. For example, the rust
| utf-8 example is already verbatim solved on reddit: https://w
| ww.reddit.com/r/rust/comments/l5m1rw/how_can_i_effi...
| lispisok wrote:
| Their demo is so similar to the Devin one I had to go look up the
| Devin one to check I wasnt watching the same demo. I feel like
| there might be a reason they both picked Sympy. Also I rarely put
| weight into demos. They are usually cherry-picked at best and
| outright fabricated at worst. I want to hear what 3rd parties
| have to say after trying these things.
| lewhoo wrote:
| Maybe that's the point of this research. Hey look, we
| reproduced the way to game the stats a bit. I really can't tell
| anymore.
| Frummy wrote:
| Interesting idea to provide the Agent-Computer Interface for it
| to scroll and such, interact easier from its perspective
| aussieguy1234 wrote:
| Similar to how early computers didn't have enough ram to
| display the whole text file, so old programmers had to work
| with parts of the file at a time. It's not a bad way to get
| around the context window problem, which is kind of similar.
| unit_circle wrote:
| A 1/8 chance of fixing a bug at the cost of a careful review and
| some corrections is not bad.
|
| 0% -> 12% improvement is not bad for two years either (I'm
| somewhat arbitrary picking the release date of ChatGPT). If this
| can be kept up for a few years we will have some extremely useful
| tooling. The cost can be relatively high as well, since
| engineering time is currently orders of magnitude more expensive
| than these tools.
| golergka wrote:
| It's still abysmal from POV of actually using it in production,
| but it's a very impressive rate of improvement. Given what
| happened with LLMs and image generation in the last few years,
| we can probably assume that these systems will be able to fix
| most trivial bugs pretty soon.
| stefan_ wrote:
| These ,,benchmark" are tuned around reporting some exciting
| result, once you look inside, all the ,,fixes" are trash.
| blharr wrote:
| I still don't know. I feel like there are many ways where GPT
| will write some code or fix a bug in a way that makes it
| significantly harder to debug. Even for relatively simple
| tasks, it's kind of like machine-generated code that I would
| not want to touch.
| WanderPanda wrote:
| It is a bit worrisome but we manage to deal with subpar human
| code as well. Often the boilerplate generated by ChatGPT is
| already better than what an unexperienced coder would string
| together. I'm sure it will not be a free lunch but the the
| benefits will probably outweigh the downsides.
|
| Interesting scalability questions will arise wrt to security
| when scaling the already unmanagably large code bases by
| another magnitude (or two), though.
| anotherpaulg wrote:
| Very cool project!
|
| I've experimented in this direction previously, but found agentic
| behavior is often chaotic and leads to long expensive sessions
| that go down a wrong rabbit hole and ultimately fail.
|
| It's great that you succeed on 12% of swe-bench, but what happens
| the other 88% of the time? Is it useless wasted work and token
| costs? Or does it make useful progress that can be salvaged?
|
| Also, I think swe-bench is from your group, right? Have you made
| any attempt to characterize a "skilled human upper bound" score?
|
| I randomly sampled a dozen swe-bench tasks myself, and found that
| many were basically impossible for a skilled human to "solve".
| Mainly because the tasks were under specified wrt to the hidden
| test cases that determine passing. The tests were checking
| implementation specific details from the repo's PR that weren't
| actually stated requirements of the task.
| a_wild_dandan wrote:
| Personally, I'd just use one of my local MacBook models (e.g.
| Mixtral 8x7b) and forget about any wasted branches & cents. My
| debugging time costs many orders of magnitude more than SWE-
| agent, so even a 5% backlog savings would be spectacular!
| ein0p wrote:
| I've tried this with another similar system. FOSS LLMs
| including Mixtral are currently too weak to handle something
| like this. For me they run out of steam after only a few
| turns and start going in circles unproductively
| swatcoder wrote:
| > My debugging time costs many orders of magnitude more than
| SWE-agent
|
| Unless your job is primarily to clean up somebody else's
| mess, your debugging time is a key part of a career-long
| feedback loop that improves your craft. Be careful not to
| shrug it off as something less. Many many people are spending
| a lot of money to let you forget it, and once you do, you'll
| be right there in the ranks of the cheaply replaceble.
|
| (And on the odd chance that cleaning up other people's mess
| _is_ your job, you should probably be the one doing it; and
| for largely the same reasons)
| Aperocky wrote:
| That's assuming that the other 95% stays the same with this
| new agent (vs creating more work for you to now also have to
| parse what the model is saying).
| trebligdivad wrote:
| So this issues arbitrary shell commands based on trying to
| understand the untrusted bug text ? Should be fun waiting until
| someone finds an escape.
| sumeruchat wrote:
| Once we have this fully automated, any good developer could have
| a team of 100 robo SWEs and ship like crazy. The real competition
| is with those devs not with the bots.
| recursive wrote:
| Shipping like crazy isn't useful by itself. Shipping non-
| garbage and being able to maintain it still has _some_ value.
| rwmj wrote:
| Do we know how much extra work it created for the real people who
| had to review the proposed fixes?
| r0ze-at-hn wrote:
| Ah, well let me tell you about my pull request reviewer LLM
| project.
| aussieguy1234 wrote:
| 12% fix rate = 88% bug rate
| mlcrypto wrote:
| Yep. After xz we don't need a bot mindlessly fixing all
| suggestions from malicious actors
| paradite wrote:
| For anyone who didn't bother looking deeper, the SWEbench
| benchmark contains only Python code projects, so it is not
| representative of all the programing languages and frameworks.
|
| I'm working on a more general SWE task eval framework in JS for
| arbitrary language and framework now (for starter JS/TS, SQL and
| Python), for my own prompt engineering product.
|
| Hit me up if you are interested.
| danenania wrote:
| I'm working on a somewhat similar project:
| https://github.com/plandex-ai/plandex
|
| While the overall goal is to build arbitrarily large, complex
| features and projects that are too much for ChatGPT or IDE-based
| tools, another aspect that I've put a lot of focus on is how to
| handle mistakes and corrections when the model starts going off
| the rails. Changes are accumulated in a protected sandbox
| separate from your project files, a diff review TUI is included
| that allows for bad changes to be rejected, all actions are
| version-controlled so you can easily go backwards and try a
| different approach, and branches are also included for trying out
| multiple approaches.
|
| I think nailing this developer-AI feedback loop is the key to
| getting authentic productivity gains. We shouldn't just ask how
| well a coding tool can pass benchmarks, but what the failure case
| looks like when things go wrong.
| etheridev wrote:
| You need to make yourself a business analyst agent to provide
| the feedback! To make it real, perhaps a team of them with
| conflicting personalities.
| danenania wrote:
| I think we'll get there at some point, but one thing I've
| learned from this project is how difficult it is to stack AI
| interactions. Each little bit of AI-based logic that gets
| added tends to fail terribly at first. Only after a long
| period of intense testing and iteration does it become
| remotely usable. The more you are combining different kinds
| of tasks, the more difficult it gets.
| dimal wrote:
| The demo shows a very clearly written bug report about a matrix
| operation that's producing an unexpected output. Umm... no. Most
| bug reports you get in the wild are more along the lines of "I
| clicked on on X and Y happened" then if you're lucky they'll say
| "and I expected Z". Usually the Z expectation is left for the
| reader to fill in because as human users we understand the
| expectations.
|
| The difficulty in fixing a bug is in figuring out what's causing
| the bug. If you know it's caused by an incorrect operation, and
| we know that LLMs can fix simple defects like this, what does
| this prove?
|
| Has anyone dug through the paper yet to see what the rest of the
| issues look like? And what the diffs look like? I suppose I'll
| try when I have a sec.
| drcode wrote:
| > Most bug reports you get in the wild are more along the lines
| of
|
| Since this fixes 12% of the bugs, the authors of the paper
| probably agree with you that 100-12= 88%, and hence "most bugs"
| don't have nicely written but reports.
| readthenotes1 wrote:
| I made a lot of money as I was paid hourly while working with a
| cadre of people I called "the defect generators".
|
| I'm kind of sad that future generations will not have that
| experience...
___________________________________________________________________
(page generated 2024-04-02 23:00 UTC)