[HN Gopher] OpenDevin: An Open Platform for AI Software Develope...
___________________________________________________________________
OpenDevin: An Open Platform for AI Software Developers as
Generalist Agents
Author : geuds
Score : 116 points
Date : 2024-08-11 12:02 UTC (10 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| ai4ever wrote:
| i dont like to discourage or be a naysayer. but,
|
| dont build a platform for software on something inherently
| unreliable. if there is one lesson i have learnt, it is that,
| systems and abstractions are built on interfaces which are
| reliable and deterministic.
|
| focus on llm usecases where accuracy is not paramount - there are
| tons of them. ocr, summarization, reporting, recommendations.
| viraptor wrote:
| People are already unreliable and non-deterministic. Looking at
| that aspect, we're not losing anything.
| ai4ever wrote:
| people may be unreliable but the software they produce needs
| to work reliably.
|
| software system is like legos. they form a system of
| dependencies. each component in the chain has interfaces
| which other components depend on. 99% reliability doesnt cut
| it for software components.
| viraptor wrote:
| I'm not sure, but you may be misunderstanding the project,
| or trying to make some point in missing. This project just
| automates some code tasks. The developer is still
| responsible for the design / reliability / component
| interfaces. If you see the result doesn't match the
| expectations, you can either finish it yourself, or send
| this tool for another loop with new instructions.
| ai4ever wrote:
| let me test it out, and then provide better feedback.
| falcor84 wrote:
| >the software they produce needs to work reliably
|
| The word "need" is an extreme overstatement here. The vast
| majority of software out there is unreliable. If anything,
| I believe it is AI that can finally bring formally verified
| software into the industry, because us regular human devs
| definitely aren't doing that.
| ai4ever wrote:
| thats a fair statement to say that humans cannot be the
| gatekeepers for accuracy or reliability.
|
| but why should the solution involve AI (thats just the
| latest bandwagon)? formal verification of software has a
| long history which has nothing to do with AI.
| cma wrote:
| Probably because of Google's recent math olympiad results
| using AI-directed search in formal proof systems.
| stale2002 wrote:
| > but why should the solution involve AI
|
| Because AI is able to produce lots of results, covering a
| wide range of domains, and it can do so cheaply.
|
| Sure, there are so quality issues. But that is the case
| for most software.
| dartos wrote:
| What part of "AI" implies "formally verified?"
| ekianjo wrote:
| And that's precisely why we don't use people to do tests and
| to ensure that things work reliably. We use code instead.
| ben_w wrote:
| I've had trouble trying to convince a few different people
| of this over the years.
|
| One case, the other dev refused to allow a commit (fine)
| because some function had known flaws and was should no
| longer be used for new code (good reason), this fact wasn't
| documented anywhere (raising flags) so I tried to add a
| deprecation tag as well as changing the thing, they refused
| to allow any deprecation tags "because committed code
| should not generate warnings" (putting the cart before the
| horse) -- and even refused accept that such a warning might
| be a useful thing for anyone. So, they became a human
| compiler in the mode of all-warnings-are-errors... but only
| they knew what the warnings were because they refused to
| allow them to be entered into code. No sense of irony. And
| of course, they didn't like it when someone else approved a
| commit before they could get in and say "no, because
| ${thing nobody else knew}".
|
| A different case, _years_ after Apple had switched ObjC to
| use ARC, the other dev was refusing to update despite the
| semi-automated tool Apple provided to help with the ARC
| transition. The C++ parts of their codebase were even
| worse, as they didn 't know anything about smart pointers
| and were using raw pointers, new, delete everywhere -- I
| still don't count myself as a C++ despite having
| occasionally used it in a few workplaces, and yet I knew
| about it even then.
|
| And, I'm sure like everyone here has experience of, I've
| seen a few too many places that rely on manual testing.
| viraptor wrote:
| That's not universal. QA teams exist for things which are
| not easy to automatically test. We also continuously test
| subjective areas like "does this website look good".
| usrbinbash wrote:
| Yes, they are, and that's precisely why we use computers and
| deterministic code for many tasks instead of people.
| ben_w wrote:
| As a result of human unreliability, we had to invent
| bureaucracy and qualifications for society at large, and
| design patterns and automated testing for software engineers
| in particular.
|
| I have a suspicion that there's a "best design pattern" and
| "best architecture" for getting the most out of existing LLMs
| (and some equivalents for non-software usage of LLMs and also
| non-LLM AI), but I'm not sure it's worth the trouble to find
| out what that is rather than just wait for AI models to get
| better.
| atemerev wrote:
| > systems and abstractions are built on interfaces which are
| reliable and deterministic.
|
| Are you sure we live in the same world? The world where there
| is Crowdstrike and a new zero day every week?
|
| Software engineering is beautifully chaotic, I like it like
| that.
| wongarsu wrote:
| Github of the project: https://github.com/OpenDevin/OpenDevin
| candiddevmike wrote:
| Why isn't this integrated with an IDE? Or am I missing that
| Onawa wrote:
| I don't believe so, it's meant to run in it's own Docker
| container sandbox. If you're looking for something that is
| integrated with IDE, my current favorite plugin is
| https://www.continue.dev/. Apache 2.0 license, local or remote
| LLM integration, automatic documentation scraping (with a hefty
| list of docs preinatalled), and the ability to selectively add
| context to your prompts (@docs, @codebase, @terminal, etc.). I
| haven't seen any great human-in-the-loop-in-the-IDE options
| quite yet.
| rambocoder wrote:
| Last time I used continue, it was still phoning home by
| default, you had to opt out of telemetry.
| rbren wrote:
| It's on the roadmap! Stay tuned...
| eterps wrote:
| Does it have different goals than: https://aider.chat ?
| adamgordonbell wrote:
| Yeah, it has more agency, looks up docs, installs dependencies,
| writes and runs tests.
|
| Aider is more understandable to me, doing small chunks of work,
| but it won't do a google search to find usage, etc. It depends
| on you to choose which files to put in context and so on.
|
| I wish aider had a bit more of the self directedness of this,
| but API calls and token usage would be greatly increased.
|
| Edit: or maybe an agency loop like this steering aider based on
| a larger goal would be useful?
| danenania wrote:
| My project Plandex[1] fits somewhere between aider and
| opendevin in terms of autonomy, so you might find it
| interesting. It attempts to complete a task autonomously in
| terms of implementing all the code, regardless of how many
| steps that takes, but it doesn't yet try to auto-select
| context, execute code, or debug its own errors. Though it
| does have a syntax validation step and a general verification
| step that can auto-fix common issues.
|
| 1 - https://plandex.ai
| bearjaws wrote:
| Probably to be fully autonomous, vs guided like aider.
|
| I still think a tool like aider is where AI is heading, these
| "agents" are built upon running systems that are 15% error
| prone and just compound errors with little ability to
| _actually_ correct them.
| bearjaws wrote:
| So does arxiv.org just let _anyone_ publish a paper now? It seems
| to be used by AI research a lot more now instead of just a blog
| post.
| fswd wrote:
| yes that's the whole point of arxiv to allow anyone to publish.
| mr_mitm wrote:
| They always let anyone publish a paper, as long as the
| submitter has an email address from a known institution OR an
| endorsement from someone who does. Any edu-email may actually
| suffice if I'm not mistaken.
| aDyslecticCrow wrote:
| arxiv.org is not a peer-reviewed publication but an archive of
| scientific documents. Notably, it includes preprints,
| conference papers, and a fair bit of bachelor's and master's
| projects.
|
| The best way to use arxiv.org is to find a paper you want to
| read from a "real" publication and get the pdf from arxiv.org
| so you can read it without the publication subscription.
|
| That is not to say arxiv.org is all horseshit though. Plenty of
| good stuff gets added there; you just need to keep your
| bullshit radar active when reading. Even some stuff published
| in Nature or IEEE smells like unwashed feet once you read them,
| let alone what arxiv.org accepts.
|
| Good citation count and decent writing are often better
| indicators than a reputable publication.
| yunohn wrote:
| The exact same thing happened with crypto and "whitepapers". I
| think it's because both these fields have so many grifters that
| believe an arxiv paper provides them much-needed legitimacy. A
| blog post doesn't have the same aura to it...
| adamgordonbell wrote:
| I tried opendevin for a sort of one off script that did some file
| processing.
|
| It was a bit inscrutable what it did, but worked no problem. Much
| like chat gpt interpreter looping on python errors until it has a
| working solution, including pip installing the right libs, and
| reading the docs of the lib for usage errors.
|
| N of 1 and a small freestanding task I had done myself already
| but I was impressed.
| yeldarb wrote:
| Tried it a few weeks ago for a task (had a few dozen files in an
| open source repo I wanted to write tests for in a similar way to
| each other).
|
| I gave it one example and then asked it to do the work for the
| other files.
|
| It was able to do about half the files correctly. But it ended up
| taking an hour, costing >$50 in OpenAI credits, and took me
| longer to debug, fix, and verify the work than it would have to
| do the work manually.
|
| My take: good glimpse of the future after a few more Moore's Law
| doublings and model improvement cycles make it 10x better, 10x
| faster, and 10x cheaper. But probably not yet worth trying to use
| for real work vs playing with it for curiosity, learning, and
| understanding.
|
| Edit: writing the tests in this PR given the code + one test as
| an example was the task:
| https://github.com/roboflow/inference/pull/533
|
| This commit was the manual example:
| https://github.com/roboflow/inference/pull/533/commits/93165...
|
| This commit adds the partially OpenDevin written ones:
| https://github.com/roboflow/inference/pull/533/commits/65f51...
| rbren wrote:
| OpenDevin maintainer here. This is a reasonable take.
|
| I have found it immensely useful for a handful of one-off
| tasks, but it's not yet a mission-critical part of my workflow
| (the way e.g. Copilot is).
|
| Core model improvements (better, faster, cheaper) will
| definitely be a tailwind for us. But there are also many things
| we can do in the abstraction layer _above_ the LLM to drive
| these things forward. And there's also a lot we can do from a
| UX perspective (e.g. IDE integrations, better human-in-the-loop
| experiences, etc)
|
| So even if models never get better (doubtful!) I'd continue to
| watch this space--it's getting better every day.
| anotherpaulg wrote:
| As a comparison, I use aider every day to develop aider.
|
| Aider wrote 61% of the new code in its last release. It's
| been averaging about 50% since the new Sonnet came out.
|
| Data and graphs about aider's contribution to its own code
| base:
|
| https://aider.chat/HISTORY.html
| Lerc wrote:
| How heavy are the API costs for that?
|
| For a project like yours I guess you should be given free
| credits. I hope that happens, but so far nobody has even
| given Karpathy a good standalone mic.
| jijji wrote:
| instead of using openAI api, can it use the locally hosted
| ollama http API?
| strangescript wrote:
| Guessing you used 4o and not 4o-mini. For stuff like this you
| are better off letting it use mini which is practically free,
| and then have it double and triple check everything.
| MattDaEskimo wrote:
| It doesn't work like that. You're more likely to end up with
| a fractal pattern of token waste, potentially veering off
| into hallucinations than some actual progress by "double" or
| "triple checking everything".
| threeseed wrote:
| This assumes that the model knows it is wrong. It doesn't.
|
| It only knows statistically what is the most likely sequence
| of words to match your query.
|
| For rarer datasets e.g. I had Claude/OpenAI help out with an
| IntelliJ plugin it would continually invent methods for
| classes that never existed. And could never articulate why.
| popinman322 wrote:
| This is where supporting machinery & RAG are very useful.
|
| You can auto- lint and test code before you set eyes on it,
| then re-run the prompt with either more context or an
| altered prompt. With local models there are options like
| steering vectors, fine-tuning, and constrained decoding as
| well.
|
| There's also evidence that multiple models of different
| lineages, when their outputs are rated and you take the
| best one at each input step, can surpass the performance of
| better models. So if one model knows something the others
| don't you can automatically fail over to the one that can
| actually handle the problem, and typically once the
| knowledge is in the chat the other models will pick it up.
|
| Not saying we have the solution to your specific problem in
| any readily available software, but that there are
| approaches specific to your problem that go beyond current
| methods.
| __loam wrote:
| This is a really complicated (and more expensive) setup
| that doesn't fundamentally fix any of the problems with
| these systems.
| threeseed wrote:
| It doesn't make sense that the solution here is to put
| more load on the user to continually adjust the prompt or
| try different models.
|
| I asked Claude and OpenAI models over 30x times to
| generate code. Both failed every time.
| OutOfHere wrote:
| 4o-mini is cheap, but is not practically free. At scale it
| will still rack up a cost, although I acknowledge that we are
| currently in the honeymoon phase with it. Computing is the
| kind of thing that we just do more of when it becomes
| cheaper, with the budget being constant.
| threeseed wrote:
| > 10x better, 10x faster, and 10x cheaper
|
| Which is the elephant in the room.
|
| There is no roadmap for any of these to happen and a strong
| possibility that we will start to see diminishing returns with
| the current LLM implementation and available datasets. At which
| point all of the hype and money will come out of the industry.
| Which in turn will cause a lull in research until the next big
| breakthrough and the cycle repeats.
| Sysreq2 wrote:
| While we have started seeing diminishing returns on rote data
| ingestion, especially with synthetic data leading to
| collapse, there is plenty of other work being done to suggest
| that the field will continue to thrive. Moore's law isn't
| going anywhere for at least a decade - so as we get more
| computing power, faster memory interconnects, and purpose
| built processors, there is no reason to suspect AI is going
| to stagnate. Right now the bottleneck is arguably more
| algorithmic than compute bound anyways. No one will ever need
| more than 640kb of RAM, right?
| thwarted wrote:
| I feel like the GP and this response are a common exchange
| right before the next AI Winter hits.
| __loam wrote:
| https://cap.csail.mit.edu/death-moores-law-what-it-means-
| and...
| threeseed wrote:
| a) It's been widely acknowledged that we are approaching a
| limit on useful datasets.
|
| b) Synthetic data sets have been shown to not be a
| substitute.
|
| c) I have no idea why you are linking Moore's Law with AI.
| Especially when it has never applied to GPUs and we are in
| a situation where we have a single vendor not subject to
| normal competition.
| Agentus wrote:
| I just viewed an Andrew NG video (he is the guy i tended to
| learn the latest best prompting, agentic, visual agentic
| practices from) that hardware companies as well as software
| are working on making these manifest especially at inference
| stage.
| GregOz wrote:
| Can you include link to Andrew NG's video please.
| __loam wrote:
| Strong chance Moores law stops this decade due to the physical
| limits on the size of atoms lol.
| andreasmetsala wrote:
| I've been hearing that for at least a decade.
| czk wrote:
| I used this to scaffold out 5 HTML pages for a web app, having it
| iterate on building the UX. Did a pretty good job and took about
| 10 minutes of iterating with it, but cost me about $10 in API
| credits which was more than I expected.
| rbren wrote:
| Cost is one of our biggest issues right now. There's a lot we
| can do to mitigate, but we've been focused on getting something
| that works well before optimizing for efficiency.
| orzig wrote:
| I think that's correct - even at a "high" cost (relative to
| what? A random SaaS app or an hour of a moderately competent
| Full Stack Dev?) the ROI will already be there for some
| projects, and as prices naturally improve a larger and larger
| portion of projects will make sense while we also build
| economies of scale with inference infrastructure.
| skywhopper wrote:
| Please don't give any tools, AI or not, the freedom to run away
| like this. You're inviting a new era of runaway worm-style
| viruses by giving such autonomy to easily manipulated programs.
|
| To what end anyway? This is massively resource heavy, and the end
| goal seems to be to build a program that would end your career.
| Please work on something that will actually make coding easier
| and safer rather than building tools to run roughshod over
| civilization.
| causal wrote:
| I suspect that the pursuit of LLM agents is rooted in falling for
| the illusion of a mind which LLMs so easily weave.
|
| So much of the stuff being built on LLMs in general seems fixated
| on making that illusion more believable.
| rbren wrote:
| This is an interesting take, but I don't think it quite
| captures the idea of "agents".
|
| I prefer to think of agents as _feedback loops_, with an LLM as
| the engine. An agent takes an action in the world, sees the
| results, then takes another action. This is what makes them so
| much more powerful than a raw LLM.
| bofadeez wrote:
| That works if the LLM has adequate external feedback from a
| terminal and browser in context with the past trial etc.
|
| It can't self-correct its own reasoning:
| https://arxiv.org/abs/2310.01798
| causal wrote:
| I think "sees the results" also embeds the idea of a mind. An
| LLM doesn't have a mind to see or plan or think with.
|
| An LLM in a loop creates agency much like a car rolling
| downhill is self driving.
| Animats wrote:
| Nice.
|
| The "Browsing agent" is a bit worrisome. That can reach outside
| the sandboxed environment. _" At each step, the agent prompts the
| LLM with the task description, browsing action space description,
| current observation of the browser using accessibility tree,
| previous actions, and an action prediction example with chain-of-
| thought reasoning. The expected response from the LLM will
| contain chain-of-thought reasoning plus the predicted next
| actions, including the option to finish the task and convey the
| result to the user."_
|
| How much can that do? Is it smart enough to navigate login and
| signup pages? Can it sign up for a social media account? Buy
| things on Amazon?
___________________________________________________________________
(page generated 2024-08-11 23:00 UTC)