hngopher.com

       [HN Gopher] OpenDevin: An Open Platform for AI Software Develope...
       ___________________________________________________________________
        
       OpenDevin: An Open Platform for AI Software Developers as
       Generalist Agents
        
       Author : geuds
       Score  : 116 points
       Date   : 2024-08-11 12:02 UTC (10 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | ai4ever wrote:
       | i dont like to discourage or be a naysayer. but,
       | 
       | dont build a platform for software on something inherently
       | unreliable. if there is one lesson i have learnt, it is that,
       | systems and abstractions are built on interfaces which are
       | reliable and deterministic.
       | 
       | focus on llm usecases where accuracy is not paramount - there are
       | tons of them. ocr, summarization, reporting, recommendations.
        
         | viraptor wrote:
         | People are already unreliable and non-deterministic. Looking at
         | that aspect, we're not losing anything.
        
           | ai4ever wrote:
           | people may be unreliable but the software they produce needs
           | to work reliably.
           | 
           | software system is like legos. they form a system of
           | dependencies. each component in the chain has interfaces
           | which other components depend on. 99% reliability doesnt cut
           | it for software components.
        
             | viraptor wrote:
             | I'm not sure, but you may be misunderstanding the project,
             | or trying to make some point in missing. This project just
             | automates some code tasks. The developer is still
             | responsible for the design / reliability / component
             | interfaces. If you see the result doesn't match the
             | expectations, you can either finish it yourself, or send
             | this tool for another loop with new instructions.
        
               | ai4ever wrote:
               | let me test it out, and then provide better feedback.
        
             | falcor84 wrote:
             | >the software they produce needs to work reliably
             | 
             | The word "need" is an extreme overstatement here. The vast
             | majority of software out there is unreliable. If anything,
             | I believe it is AI that can finally bring formally verified
             | software into the industry, because us regular human devs
             | definitely aren't doing that.
        
               | ai4ever wrote:
               | thats a fair statement to say that humans cannot be the
               | gatekeepers for accuracy or reliability.
               | 
               | but why should the solution involve AI (thats just the
               | latest bandwagon)? formal verification of software has a
               | long history which has nothing to do with AI.
        
               | cma wrote:
               | Probably because of Google's recent math olympiad results
               | using AI-directed search in formal proof systems.
        
               | stale2002 wrote:
               | > but why should the solution involve AI
               | 
               | Because AI is able to produce lots of results, covering a
               | wide range of domains, and it can do so cheaply.
               | 
               | Sure, there are so quality issues. But that is the case
               | for most software.
        
               | dartos wrote:
               | What part of "AI" implies "formally verified?"
        
           | ekianjo wrote:
           | And that's precisely why we don't use people to do tests and
           | to ensure that things work reliably. We use code instead.
        
             | ben_w wrote:
             | I've had trouble trying to convince a few different people
             | of this over the years.
             | 
             | One case, the other dev refused to allow a commit (fine)
             | because some function had known flaws and was should no
             | longer be used for new code (good reason), this fact wasn't
             | documented anywhere (raising flags) so I tried to add a
             | deprecation tag as well as changing the thing, they refused
             | to allow any deprecation tags "because committed code
             | should not generate warnings" (putting the cart before the
             | horse) -- and even refused accept that such a warning might
             | be a useful thing for anyone. So, they became a human
             | compiler in the mode of all-warnings-are-errors... but only
             | they knew what the warnings were because they refused to
             | allow them to be entered into code. No sense of irony. And
             | of course, they didn't like it when someone else approved a
             | commit before they could get in and say "no, because
             | ${thing nobody else knew}".
             | 
             | A different case, _years_ after Apple had switched ObjC to
             | use ARC, the other dev was refusing to update despite the
             | semi-automated tool Apple provided to help with the ARC
             | transition. The C++ parts of their codebase were even
             | worse, as they didn 't know anything about smart pointers
             | and were using raw pointers, new, delete everywhere -- I
             | still don't count myself as a C++ despite having
             | occasionally used it in a few workplaces, and yet I knew
             | about it even then.
             | 
             | And, I'm sure like everyone here has experience of, I've
             | seen a few too many places that rely on manual testing.
        
             | viraptor wrote:
             | That's not universal. QA teams exist for things which are
             | not easy to automatically test. We also continuously test
             | subjective areas like "does this website look good".
        
           | usrbinbash wrote:
           | Yes, they are, and that's precisely why we use computers and
           | deterministic code for many tasks instead of people.
        
           | ben_w wrote:
           | As a result of human unreliability, we had to invent
           | bureaucracy and qualifications for society at large, and
           | design patterns and automated testing for software engineers
           | in particular.
           | 
           | I have a suspicion that there's a "best design pattern" and
           | "best architecture" for getting the most out of existing LLMs
           | (and some equivalents for non-software usage of LLMs and also
           | non-LLM AI), but I'm not sure it's worth the trouble to find
           | out what that is rather than just wait for AI models to get
           | better.
        
         | atemerev wrote:
         | > systems and abstractions are built on interfaces which are
         | reliable and deterministic.
         | 
         | Are you sure we live in the same world? The world where there
         | is Crowdstrike and a new zero day every week?
         | 
         | Software engineering is beautifully chaotic, I like it like
         | that.
        
       | wongarsu wrote:
       | Github of the project: https://github.com/OpenDevin/OpenDevin
        
       | candiddevmike wrote:
       | Why isn't this integrated with an IDE? Or am I missing that
        
         | Onawa wrote:
         | I don't believe so, it's meant to run in it's own Docker
         | container sandbox. If you're looking for something that is
         | integrated with IDE, my current favorite plugin is
         | https://www.continue.dev/. Apache 2.0 license, local or remote
         | LLM integration, automatic documentation scraping (with a hefty
         | list of docs preinatalled), and the ability to selectively add
         | context to your prompts (@docs, @codebase, @terminal, etc.). I
         | haven't seen any great human-in-the-loop-in-the-IDE options
         | quite yet.
        
           | rambocoder wrote:
           | Last time I used continue, it was still phoning home by
           | default, you had to opt out of telemetry.
        
         | rbren wrote:
         | It's on the roadmap! Stay tuned...
        
       | eterps wrote:
       | Does it have different goals than: https://aider.chat ?
        
         | adamgordonbell wrote:
         | Yeah, it has more agency, looks up docs, installs dependencies,
         | writes and runs tests.
         | 
         | Aider is more understandable to me, doing small chunks of work,
         | but it won't do a google search to find usage, etc. It depends
         | on you to choose which files to put in context and so on.
         | 
         | I wish aider had a bit more of the self directedness of this,
         | but API calls and token usage would be greatly increased.
         | 
         | Edit: or maybe an agency loop like this steering aider based on
         | a larger goal would be useful?
        
           | danenania wrote:
           | My project Plandex[1] fits somewhere between aider and
           | opendevin in terms of autonomy, so you might find it
           | interesting. It attempts to complete a task autonomously in
           | terms of implementing all the code, regardless of how many
           | steps that takes, but it doesn't yet try to auto-select
           | context, execute code, or debug its own errors. Though it
           | does have a syntax validation step and a general verification
           | step that can auto-fix common issues.
           | 
           | 1 - https://plandex.ai
        
         | bearjaws wrote:
         | Probably to be fully autonomous, vs guided like aider.
         | 
         | I still think a tool like aider is where AI is heading, these
         | "agents" are built upon running systems that are 15% error
         | prone and just compound errors with little ability to
         | _actually_ correct them.
        
       | bearjaws wrote:
       | So does arxiv.org just let _anyone_ publish a paper now? It seems
       | to be used by AI research a lot more now instead of just a blog
       | post.
        
         | fswd wrote:
         | yes that's the whole point of arxiv to allow anyone to publish.
        
         | mr_mitm wrote:
         | They always let anyone publish a paper, as long as the
         | submitter has an email address from a known institution OR an
         | endorsement from someone who does. Any edu-email may actually
         | suffice if I'm not mistaken.
        
         | aDyslecticCrow wrote:
         | arxiv.org is not a peer-reviewed publication but an archive of
         | scientific documents. Notably, it includes preprints,
         | conference papers, and a fair bit of bachelor's and master's
         | projects.
         | 
         | The best way to use arxiv.org is to find a paper you want to
         | read from a "real" publication and get the pdf from arxiv.org
         | so you can read it without the publication subscription.
         | 
         | That is not to say arxiv.org is all horseshit though. Plenty of
         | good stuff gets added there; you just need to keep your
         | bullshit radar active when reading. Even some stuff published
         | in Nature or IEEE smells like unwashed feet once you read them,
         | let alone what arxiv.org accepts.
         | 
         | Good citation count and decent writing are often better
         | indicators than a reputable publication.
        
         | yunohn wrote:
         | The exact same thing happened with crypto and "whitepapers". I
         | think it's because both these fields have so many grifters that
         | believe an arxiv paper provides them much-needed legitimacy. A
         | blog post doesn't have the same aura to it...
        
       | adamgordonbell wrote:
       | I tried opendevin for a sort of one off script that did some file
       | processing.
       | 
       | It was a bit inscrutable what it did, but worked no problem. Much
       | like chat gpt interpreter looping on python errors until it has a
       | working solution, including pip installing the right libs, and
       | reading the docs of the lib for usage errors.
       | 
       | N of 1 and a small freestanding task I had done myself already
       | but I was impressed.
        
       | yeldarb wrote:
       | Tried it a few weeks ago for a task (had a few dozen files in an
       | open source repo I wanted to write tests for in a similar way to
       | each other).
       | 
       | I gave it one example and then asked it to do the work for the
       | other files.
       | 
       | It was able to do about half the files correctly. But it ended up
       | taking an hour, costing >$50 in OpenAI credits, and took me
       | longer to debug, fix, and verify the work than it would have to
       | do the work manually.
       | 
       | My take: good glimpse of the future after a few more Moore's Law
       | doublings and model improvement cycles make it 10x better, 10x
       | faster, and 10x cheaper. But probably not yet worth trying to use
       | for real work vs playing with it for curiosity, learning, and
       | understanding.
       | 
       | Edit: writing the tests in this PR given the code + one test as
       | an example was the task:
       | https://github.com/roboflow/inference/pull/533
       | 
       | This commit was the manual example:
       | https://github.com/roboflow/inference/pull/533/commits/93165...
       | 
       | This commit adds the partially OpenDevin written ones:
       | https://github.com/roboflow/inference/pull/533/commits/65f51...
        
         | rbren wrote:
         | OpenDevin maintainer here. This is a reasonable take.
         | 
         | I have found it immensely useful for a handful of one-off
         | tasks, but it's not yet a mission-critical part of my workflow
         | (the way e.g. Copilot is).
         | 
         | Core model improvements (better, faster, cheaper) will
         | definitely be a tailwind for us. But there are also many things
         | we can do in the abstraction layer _above_ the LLM to drive
         | these things forward. And there's also a lot we can do from a
         | UX perspective (e.g. IDE integrations, better human-in-the-loop
         | experiences, etc)
         | 
         | So even if models never get better (doubtful!) I'd continue to
         | watch this space--it's getting better every day.
        
           | anotherpaulg wrote:
           | As a comparison, I use aider every day to develop aider.
           | 
           | Aider wrote 61% of the new code in its last release. It's
           | been averaging about 50% since the new Sonnet came out.
           | 
           | Data and graphs about aider's contribution to its own code
           | base:
           | 
           | https://aider.chat/HISTORY.html
        
             | Lerc wrote:
             | How heavy are the API costs for that?
             | 
             | For a project like yours I guess you should be given free
             | credits. I hope that happens, but so far nobody has even
             | given Karpathy a good standalone mic.
        
         | jijji wrote:
         | instead of using openAI api, can it use the locally hosted
         | ollama http API?
        
         | strangescript wrote:
         | Guessing you used 4o and not 4o-mini. For stuff like this you
         | are better off letting it use mini which is practically free,
         | and then have it double and triple check everything.
        
           | MattDaEskimo wrote:
           | It doesn't work like that. You're more likely to end up with
           | a fractal pattern of token waste, potentially veering off
           | into hallucinations than some actual progress by "double" or
           | "triple checking everything".
        
           | threeseed wrote:
           | This assumes that the model knows it is wrong. It doesn't.
           | 
           | It only knows statistically what is the most likely sequence
           | of words to match your query.
           | 
           | For rarer datasets e.g. I had Claude/OpenAI help out with an
           | IntelliJ plugin it would continually invent methods for
           | classes that never existed. And could never articulate why.
        
             | popinman322 wrote:
             | This is where supporting machinery & RAG are very useful.
             | 
             | You can auto- lint and test code before you set eyes on it,
             | then re-run the prompt with either more context or an
             | altered prompt. With local models there are options like
             | steering vectors, fine-tuning, and constrained decoding as
             | well.
             | 
             | There's also evidence that multiple models of different
             | lineages, when their outputs are rated and you take the
             | best one at each input step, can surpass the performance of
             | better models. So if one model knows something the others
             | don't you can automatically fail over to the one that can
             | actually handle the problem, and typically once the
             | knowledge is in the chat the other models will pick it up.
             | 
             | Not saying we have the solution to your specific problem in
             | any readily available software, but that there are
             | approaches specific to your problem that go beyond current
             | methods.
        
               | __loam wrote:
               | This is a really complicated (and more expensive) setup
               | that doesn't fundamentally fix any of the problems with
               | these systems.
        
               | threeseed wrote:
               | It doesn't make sense that the solution here is to put
               | more load on the user to continually adjust the prompt or
               | try different models.
               | 
               | I asked Claude and OpenAI models over 30x times to
               | generate code. Both failed every time.
        
           | OutOfHere wrote:
           | 4o-mini is cheap, but is not practically free. At scale it
           | will still rack up a cost, although I acknowledge that we are
           | currently in the honeymoon phase with it. Computing is the
           | kind of thing that we just do more of when it becomes
           | cheaper, with the budget being constant.
        
         | threeseed wrote:
         | > 10x better, 10x faster, and 10x cheaper
         | 
         | Which is the elephant in the room.
         | 
         | There is no roadmap for any of these to happen and a strong
         | possibility that we will start to see diminishing returns with
         | the current LLM implementation and available datasets. At which
         | point all of the hype and money will come out of the industry.
         | Which in turn will cause a lull in research until the next big
         | breakthrough and the cycle repeats.
        
           | Sysreq2 wrote:
           | While we have started seeing diminishing returns on rote data
           | ingestion, especially with synthetic data leading to
           | collapse, there is plenty of other work being done to suggest
           | that the field will continue to thrive. Moore's law isn't
           | going anywhere for at least a decade - so as we get more
           | computing power, faster memory interconnects, and purpose
           | built processors, there is no reason to suspect AI is going
           | to stagnate. Right now the bottleneck is arguably more
           | algorithmic than compute bound anyways. No one will ever need
           | more than 640kb of RAM, right?
        
             | thwarted wrote:
             | I feel like the GP and this response are a common exchange
             | right before the next AI Winter hits.
        
             | __loam wrote:
             | https://cap.csail.mit.edu/death-moores-law-what-it-means-
             | and...
        
             | threeseed wrote:
             | a) It's been widely acknowledged that we are approaching a
             | limit on useful datasets.
             | 
             | b) Synthetic data sets have been shown to not be a
             | substitute.
             | 
             | c) I have no idea why you are linking Moore's Law with AI.
             | Especially when it has never applied to GPUs and we are in
             | a situation where we have a single vendor not subject to
             | normal competition.
        
           | Agentus wrote:
           | I just viewed an Andrew NG video (he is the guy i tended to
           | learn the latest best prompting, agentic, visual agentic
           | practices from) that hardware companies as well as software
           | are working on making these manifest especially at inference
           | stage.
        
             | GregOz wrote:
             | Can you include link to Andrew NG's video please.
        
         | __loam wrote:
         | Strong chance Moores law stops this decade due to the physical
         | limits on the size of atoms lol.
        
           | andreasmetsala wrote:
           | I've been hearing that for at least a decade.
        
       | czk wrote:
       | I used this to scaffold out 5 HTML pages for a web app, having it
       | iterate on building the UX. Did a pretty good job and took about
       | 10 minutes of iterating with it, but cost me about $10 in API
       | credits which was more than I expected.
        
         | rbren wrote:
         | Cost is one of our biggest issues right now. There's a lot we
         | can do to mitigate, but we've been focused on getting something
         | that works well before optimizing for efficiency.
        
           | orzig wrote:
           | I think that's correct - even at a "high" cost (relative to
           | what? A random SaaS app or an hour of a moderately competent
           | Full Stack Dev?) the ROI will already be there for some
           | projects, and as prices naturally improve a larger and larger
           | portion of projects will make sense while we also build
           | economies of scale with inference infrastructure.
        
       | skywhopper wrote:
       | Please don't give any tools, AI or not, the freedom to run away
       | like this. You're inviting a new era of runaway worm-style
       | viruses by giving such autonomy to easily manipulated programs.
       | 
       | To what end anyway? This is massively resource heavy, and the end
       | goal seems to be to build a program that would end your career.
       | Please work on something that will actually make coding easier
       | and safer rather than building tools to run roughshod over
       | civilization.
        
       | causal wrote:
       | I suspect that the pursuit of LLM agents is rooted in falling for
       | the illusion of a mind which LLMs so easily weave.
       | 
       | So much of the stuff being built on LLMs in general seems fixated
       | on making that illusion more believable.
        
         | rbren wrote:
         | This is an interesting take, but I don't think it quite
         | captures the idea of "agents".
         | 
         | I prefer to think of agents as _feedback loops_, with an LLM as
         | the engine. An agent takes an action in the world, sees the
         | results, then takes another action. This is what makes them so
         | much more powerful than a raw LLM.
        
           | bofadeez wrote:
           | That works if the LLM has adequate external feedback from a
           | terminal and browser in context with the past trial etc.
           | 
           | It can't self-correct its own reasoning:
           | https://arxiv.org/abs/2310.01798
        
           | causal wrote:
           | I think "sees the results" also embeds the idea of a mind. An
           | LLM doesn't have a mind to see or plan or think with.
           | 
           | An LLM in a loop creates agency much like a car rolling
           | downhill is self driving.
        
       | Animats wrote:
       | Nice.
       | 
       | The "Browsing agent" is a bit worrisome. That can reach outside
       | the sandboxed environment. _" At each step, the agent prompts the
       | LLM with the task description, browsing action space description,
       | current observation of the browser using accessibility tree,
       | previous actions, and an action prediction example with chain-of-
       | thought reasoning. The expected response from the LLM will
       | contain chain-of-thought reasoning plus the predicted next
       | actions, including the option to finish the task and convey the
       | result to the user."_
       | 
       | How much can that do? Is it smart enough to navigate login and
       | signup pages? Can it sign up for a social media account? Buy
       | things on Amazon?
        
       ___________________________________________________________________
       (page generated 2024-08-11 23:00 UTC)