[HN Gopher] Wasting Inferences with Aider
___________________________________________________________________
Wasting Inferences with Aider
Author : Stwerner
Score : 102 points
Date : 2025-04-13 13:36 UTC (9 hours ago)
(HTM) web link (worksonmymachine.substack.com)
(TXT) w3m dump (worksonmymachine.substack.com)
| evertedsphere wrote:
| love to see "Why It Matters" turn into the heading equivalent to
| "delve" in body text (although different in that the latter is a
| legitimate word while the former is a "we need to talk
| about..."-level turn of phrase)
| DeathArrow wrote:
| I don't really think having an agent fleet is a much better
| solution than having a single agent.
|
| We would like to think that having 10 agents working on the same
| task will improve the chances of success 10x.
|
| But I would argue that some classes of problems are hard for LLMs
| and where one agent will fail, 10 agents or 100 agents will fail
| too.
|
| As an easy example I suggest leetcode hard problems.
| adhamsalama wrote:
| We need The Mythical Man-Month: LLM version book.
| skeledrew wrote:
| The fleet approach can work well particularly because: 1)
| different models are trained differently, even though using
| mostly same data (think someone who studied SWE at MIT, vs one
| who studied at Harvard), 2) different agents can be given
| different prompts, which specializes their focus (think coder
| vs reviewer), and 3) the context window content influences the
| result (think someone who's seen the history of implementation
| attempts, vs one seeing a problem for the first time). Put
| those traits in various combinations and the results will be
| very different from a single agent.
| regularfry wrote:
| Nit: it doesn't 10x the chance of success, it (the chance of
| failure)^10.
| eMPee584 wrote:
| neither, probably
| ghuntley wrote:
| I'm authoring a self-compiling compiler with custom lexical
| tokens via LLM. I'm almost at stage 2, and approximately 50
| "stdlib" concerns have specifications authored for them.
|
| The idea of doing them individually in the IDE is very
| unappealing. Now that the object system, ast, lexer, parser,
| and garbage collection have stabilized, the codebase is at a
| point where fanning out agents makes sense.
|
| As stage 3 nears, it won't make sense to fan out until the
| fundamentals are ready again/stabilised, but at that point,
| I'll need to fan out again.
|
| https://x.com/GeoffreyHuntley/status/1911031587028042185
| joshstrange wrote:
| This is a very interesting idea and I really should consider
| Aider in the "scriptable" sense more, I only use interactively.
|
| I might add another step after each PR is created where another
| agent(s?) review and compare the results (maybe have the other 2
| agents review the first agents code?).
| Stwerner wrote:
| Thanks, and having another step for reviewing each other's code
| is a really cool extension to this, I'll give it a shot :)
| Whether it works or it doesn't it could be really interesting
| for a future post!
| brookst wrote:
| Wonder if you could have the reviewer characterize any
| mistakes and feed those back into the coding prompt: "be sure
| to... be sure not to..."
| IshKebab wrote:
| We're going to have no traditional programming in 2 years?
| Riiight.
|
| It would also be nice to see a demo where the task was something
| that I couldn't have done myself in essentially no time. Like,
| what happens if you say "tasks should support tags, and you
| should be able to filter/group tasks by tag"?
| Stwerner wrote:
| Gave it a shot real quick, looks like I need to fix something
| up about automatically running the migrations either in the CI
| script or locally...
|
| But if you're curious, task was this:
|
| ----
|
| Title: Bug: Users should be able to add tags to a task to
| categorize them
|
| Description: Users should be able to add multiple tags to a
| task but aren't currently able to.
|
| Given I am a user with multiple tasks When I select one Then I
| should be able to add one or many tags to it
|
| Given I am a user with multiple tasks each with multiple tags
| When I view the list of tasks Then I should be able to see the
| tags associated with each task
|
| ----
|
| And then we ended up with:
|
| GPT-4o ($0.05):
| https://github.com/sublayerapp/buggy_todo_app/pull/51
|
| Claude 3.5 Sonnet ($0.09):
| https://github.com/sublayerapp/buggy_todo_app/pull/52
|
| Gemini 2.0 Flash ($0.0018):
| https://github.com/sublayerapp/buggy_todo_app/pull/53
|
| One thing to note that I've found - I know you had the "...and
| you should be able to filter/group tasks by tag" on the request
| - usually when you have a request that is "feature A AND
| feature B" you get better results when you break it down into
| smaller pieces and apply them one by one. I'm pretty confident
| that if I spent time to get the migrations running, we'd be
| able to build that request out story-by-story as long as we
| break it out into bite-sized pieces.
| IanCal wrote:
| You can have a larger model split things out into more
| manageable steps and create new tickets - marked as blocked
| or not on each other, then have the whole thing run.
| victorbjorklund wrote:
| Wouldnt AI be perfect for those easy tasks? They still take
| time if you wanna do it "properly" with a new branch etc. I get
| lots of "can you change the padding for that component". And
| that is all. Is it easy? Sure. But still takes time to open the
| project, create a new branch, make the change, push the change,
| create a merge request, etc. That probably takes me 10 min.
|
| If I could just let the AI do all of them and just go in and
| check the merge requests and approve them it would save me
| time.
| emorning3 wrote:
| I see 'Waste Inferences' as a form of abductive reasoning.
|
| I see LLMs as a form of inductive reasoning, and so I can see how
| WI could extend LLMs.
|
| Also, I have no doubt that there are problems that can't be
| solved with just an LLM but would need abductive extensions.
|
| Same comments apply to deductive (logical) extensions to LLMs.
| phamilton wrote:
| Sincere question: Has anyone figured out how we're going to code
| review the output of an agent fleet?
| jsheard wrote:
| Insincere answer that will probably be attempted sincerely
| nonetheless: throw even more agents at the problem by having
| them do code review as well. The solution to problems caused by
| AI is always more AI.
| brookst wrote:
| s/AI/tech
| regularfry wrote:
| Technically that's known as "LLM-as-judge" and it's all over
| the literature. The intuition would be that the capability to
| choose between two candidates doesn't exactly overlap with
| the ability to generate either one of them from scratch. It's
| a bit like how (half of) generative adversarial networks
| work.
| fxtentacle wrote:
| You just don't. Choose randomly and then try to quickly sell
| the company. /s
| lsllc wrote:
| Simple, just ask an(other) AI! But seriously, different models
| are better/worse at different tasks, so if you can figure out
| which model is best at evaluating changes, use that for the
| review.
| nchmy wrote:
| sincere question: why would you not be able to code review it
| in the same way you would for humans?
| sensanaty wrote:
| Most of the people pushing this want to just sell an MVP and
| get a big exit before everything collapses, so code review is
| irrelevant.
| danenania wrote:
| Plandex[1] uses a similar "wasteful" approach for file edits
| (note: I'm the creator). It orchestrates a race between diff-
| style replacements plus validation, writing the whole file with
| edits incorporated, and (on the cloud service) a specialized
| model plus validation.
|
| While it sounds wasteful, the calls are all very cheap since most
| of the input tokens are cached, and once a valid result is
| achieved, other in-flight requests are cancelled. It's working
| quite well, allowing for quick results on easy edits with
| fallbacks for more complex changes/large files that don't feel
| incredibly slow.
|
| 1 - https://github.com/plandex-ai/plandex
| billmalarky wrote:
| I've been lucky enough to have a few conversations with Scott a
| month or so ago and he is doing some really compelling work
| around the AISDLC and creating a factory line approach to
| building software. Seriously folks, I recommend following this
| guy closely.
|
| There's another guy in this space I know who's doing similar
| incredible things but he doesn't really speak about it publicly
| so don't want to discuss w/o his permission. I'm happy to make an
| introduction for those interested just hmu (check my profile for
| how).
|
| Really excited to see you on the FP of HN Scott!
| fxtentacle wrote:
| For me, a team of junior developers that refuse to learn from
| their mistakes is the fuel of nightmares. I'm stuck in a loop
| where every day I need to explain to a new hire why they made the
| exact same beginner's mistake as the last person on the last day.
| Eventually, I'd rather spend half an hour of my own time than to
| explain the problem once more...
|
| Why anyone thinks having 3 different PRs for each jira ticket
| might boost productivity, is beyond me.
|
| Related anime: I May Be a Guild Receptionist, But I'll Solo Any
| Boss to Clock Out on Time
| simonw wrote:
| One of the (many) differences between junior developers and LLM
| assistance is that humans can learn from their mistakes,
| whereas with LLMs it's up to you as the prompter to learn from
| their mistakes.
|
| If an LLM screws something up you can often adjust their prompt
| to avoid that particular problem in the future.
| abc-1 wrote:
| Darn I wonder if systems could be modified so that common
| mistakes become less common or if documentation could be
| written once and read multiple times by different people.
| danielbln wrote:
| We feed it conventions that are automatically loaded for
| every LLM task, do that the LLM adheres to coding style,
| comment style, common project tooling and architecture etc.
|
| These systems don't do online learning, but that doesn't mean
| you can spoon feed them what they should know and mutate that
| knowledge over time.
| noodletheworld wrote:
| It may not be as stupid as it sounds.
|
| Randomising LLM outputs (temperature) results is outputs that
| will always have some degree of hallucination.
|
| That's just math. You can't mix a random factor in and
| magically expect it to not exist. There will always be
| p(generates random crap) > 0.
|
| However, in any probabilistic system, you can run a function k
| times and you'll get an output distribution that is meaningful
| if k is high enough.
|
| 3 is not high enough.
|
| At 3, this is stupid; all you're observing is random variance.
|
| ...but, _in general_ , running the same prompt multiple times
| and taking some kind of general solution from the distribution
| isn't totally meaningless, I guess.
|
| The thing with LLMs is they scale in a way that actually allows
| this to be possible, in a way that scaling with humans can't.
|
| ... like the monkeys and Shakespeare, there probably a limit to
| the value it can offer; but it's not totally meaningless to try
| it.
| horsawlarway wrote:
| I think this is an interesting idea, but I also somewhat
| suspect you've replaced a tedious problem with a harder, more
| tedious problem.
|
| Take your idea further. Now I've got 100 agents, and 100 PRs,
| and some small percentage of them are decent. The task went
| from "implement a feature" to "review 100 PRs and select the
| best one".
|
| Even assuming you can ditch 50 percent right off the bat as
| trash... Reviewing 50 potentially buggy implementations of a
| feature and selecting the best genuinely sounds worse than
| just writing the solution.
|
| Worse... If you haven't solved the problem before anyways,
| you're woefully unqualified as a reviewer.
| skeledrew wrote:
| There should be test cases ran, coverage ensured. This is
| trivially automated. LLMs should also review the PRs, at
| least initially, using the test results as part of the
| input.
| horsawlarway wrote:
| So now either the agent is writing the tests, in which
| case you're right back to the same issue (which tests are
| actually worth running?) or your job is now just writing
| tests (bleh...).
|
| And for the llm review of the pr... Why do you assume
| it'll be worth any more then the original implementation?
| Or are we just recursing down a level again (if 100 llms
| review each of the 100 PRs... To infinity and beyond!)
|
| This by definition is _not_ trivially automated.
| skeledrew wrote:
| The LLMs can help with the writing of the tests, but you
| should verify that they're testing critical aspects and
| known edge cases are covered. A single review-promoted
| LLM can then utilize those across the PRs and provide a
| summary for acceptance the the best. Or discard all and
| do manually; that initial process should only have taken
| a few minutes, so minimal wastage in the grand scheme of
| things, given over time there are a decent amount of
| acceptances, compared to the alternative 100% manual
| effort and associated time sunk.
| lolinder wrote:
| Who tests the tests? How do you know that the LLM-
| generated tests are actually asserting anything
| meaningful and cover the relevant edge cases?
|
| The tests are part of the code that needs to be reviewed
| in the PR by a human. They don't solve the problem, they
| just add more lines to the reviewer's job.
| CognitiveLens wrote:
| The linked article from Steve Yegge
| (https://sourcegraph.com/blog/revenge-of-the-junior-
| developer) provides a 'solution', which he thinks is also
| imminent - supervisor AI agents, where you might have 100+
| coding agents creating PRs, but then a layer of supervisors
| that are specialized on evaluating quality, and the only
| PRs that a human being would see would be the 'best', as
| determined by the supervisor agent layer.
|
| From my experience with AI agents, this feels intuitively
| possible - current agents seem to be ok (thought not yet
| 'great') at critiquing solutions, and such supervisor
| agents could help keep the broader system in alignment.
| regularfry wrote:
| Three is high enough, in my eyes. Two might be. Remember that
| we don't care about any but the best solution. With one
| sample you've only got one 50/50 shot to get above the
| median. With three, the odds of the best of the three being
| above the median is 87.5%.
|
| Of course picking the median as the random crap boundary is
| entirely arbitrary, but it'll do until there's a
| justification for a better number.
| fn-mote wrote:
| > taking some kind of general solution from the distribution
|
| My instinct is that this should be the temperature 0K
| response (no randomness).
| nico wrote:
| > For me, a team of junior developers that refuse to learn from
| their mistakes is the fuel of nightmares. I'm stuck in a loop
| where every day I need to explain to a new hire why they made
| the exact same
|
| This is a huge opportunity, maybe the next big breakthrough in
| AI when someone figures out how to solve it
|
| Instead of having a model that knows everything, have a model
| that can learn on the go from the feedback it gets from the
| user
|
| Ideally a local model too. So something that runs on my
| computer that I train with my own feedback so that it gets
| better at the tasks I need it to perform
|
| You could also have one at team level, a model that learns from
| the whole team to perform the tasks the team needs it to
| perform
| freeone3000 wrote:
| Continual feedback means continual training. No way around
| it. So you'd have to scope down the functional unit to a
| fairly small lora in order to get reasonable re-training
| costs here.
| nico wrote:
| Or maybe figure out a different architecture
|
| Either way, the end user experience would be vastly
| improved
| regularfry wrote:
| That's not quite true. The system prompt is state that you
| can use for "training" in a way that fits the problem here.
| It's not differentiable so you're in slightly alien
| territory, but it's also more comprehensible than gradient-
| descending a bunch of weights.
| sdesol wrote:
| > This is a huge opportunity, maybe the next big breakthrough
| in AI when someone figures out how to solve it
|
| I am not saying I solved it, but I believe we are going to
| experience a paradigm shift in how we program and teach and
| for some, they are really going to hate it. With AI, we can
| now easily capture the thought process for how we solve
| problems, but there is a catch. For this to work, senior
| developers will need to come to terms that their value is not
| in writing code, but solving problems.
|
| I would say 98% of my code is now AI generated and I have 0%
| fear that it will make me dumber. I will 100% become less
| proficient in writing code, but my problem solving skills
| will not go away and will only get sharper. In the example
| below, 100% of the code/documentation was AI generated, but I
| still needed to guide Gemini 2.5 Pro
|
| https://app.gitsense.com/?chat=c35f87c5-5b61-4cab-873b-a3988.
| ..
|
| After reviewing the code, it was clear what the problem was
| and since I didn't want to waste token and time, I literally
| suggested the implementation and told it to not generate any
| code, but asks it to explain the problem and the solution, as
| shown below.
|
| > The bug is still there. Why does it not use states to to
| track the start @@ and end @@? If you encounter @@ , you can
| do an if else on the line by asking if the line ends with @@.
| If so, you can change the state to expect replacement start
| delimiter. If it does not end with @@ you can set the state
| to expect line to end with @@ and not start with @@. Do you
| understand? Do not generate any code yet.
|
| How I see things evolving over time is, senior developers
| will start to code less and less and the role for junior
| developers will not only be to code but to review
| conversations. As we add new features and fix bugs, we will
| start to link to conversations that Junior developers can
| learn from. The Dooms day scenario is obviously, with enough
| conversations, we may reach the point where AI can solve most
| problems one shot.
|
| Full Disclosure: This is my tool
| ghuntley wrote:
| > For this to work, senior developers will need to come to
| terms that their value is not in writing code, but solving
| problems.
|
| This is the key reason behind authoring
| https://ghuntley.com/ngmi - developers that come to terms
| with the new norm will flourish yet the developers who
| don't will struggle in corporate...
| sdesol wrote:
| Nice write up. I don't necessary think it is "if you
| adopt it, you will flourish" as much as, if you have this
| type of "personality you will easily become 10x, if you
| have this type of personality you will become 2x and if
| you have this type, you will become .5x".
|
| I'm obviously biased, but I believe developers with a
| technical entrepreneur mindset, will see the most
| benefit. This paradigm shift requires the ability to
| properly articulate your thoughts and be able to create
| problem statements for every action. And honestly, not
| everybody can do this.
|
| Obviously, a lot depends on the problems being solved and
| how well trained the LLM is in that person's domain. I
| had Claude and a bunch of other models write my GitSense
| Chat Bridge code which makes it possible to bring Git's
| history into my chat app and it is slow as hell. It works
| most of the time, but it was obvious that the design
| pattern was based on simple CRUD apps. And this is where
| LLMs will literally slow you down and I know this because
| I already solved this problem. The LLM generated chat
| bridge code will be free and open sourced but I will
| charge for my optimized indexing engine.
| bbatchelder wrote:
| Even with human junior devs, ideally you'd maintain some
| documentation about common mistakes/gotchas so that when you
| onboard new people to the team they can read that instead of
| you having to hold their hand manually.
|
| You can do the same thing for LLMs by keeping a file with those
| details available and included in their context.
|
| You can even set up evaluation loops so that entries can be
| made by other agents.
| ghuntley wrote:
| correct.
| wrs wrote:
| I've been using Cursor and Code regularly for a few months now
| and the idea of letting three of them run free on the codebase
| seems insane. The reason for the chat interface is that the agent
| goes off the rails on a regular basis. At least 25% of the time I
| have to hit the stop button and go back to a checkpoint because
| the automatic lawnmower has started driving through the flowerbed
| again. And paradoxically, the more capable the model gets, the
| more likely it seems to get random ideas of how to fix things
| that aren't broken.
| barrell wrote:
| Had a similar experience with Claude Code lately. I got a
| notice some credits were expiring, so I opened up Claude Code
| and asked it to fix all the credo errors in an elixir project
| (style guide enforcement).
|
| I gave it incredibly clear steps of what to run in what
| process, maybe 6 steps, 4 of which were individual severity
| levels.
|
| Within a few minutes it would as to commit code, create
| branches, run tests, start servers -- always something new,
| none of which were in my instructions. It would also often run
| mix credo, get a list of warnings, deem them unimportant, then
| try to go do its own thing.
|
| It was really cool, I basically worked through 1000 formatting
| errors in 2 hours with $40 of credits (that I would have had no
| use for otherwise).
|
| But man, I can't imagine letting this thing run a single
| command without checking the output
| tekacs wrote:
| So... I know that people frame these sorts of things as if
| it's some kind of quantization conspiracy, but as someone who
| started using Claude Code the _moment_ that it came out, it
| felt particularly strong. Then, it feels like they... tweaked
| something, whether in CC or Sonnet 3.7 and it went a little
| downhill. It's still very impressive, but something was lost.
|
| I've found Gemini 2.5 Pro to be extremely impressive and much
| more able to run in an extended fashion by itself, although
| I've found very high variability in how well 'agent mode'
| works between different editors. Cursor has been very very
| weak in this regard for me, with Windsurf working a little
| better. Claude Code is excellent, but at the moment does feel
| let down by the model.
|
| I've been using Aider with Gemini 2.5 Pro and found that it's
| very much able to 'just go' by itself. I shipped a mode for
| Aider that lets it do so (sibling comment here) and I've had
| it do some huge things that run for an hour or more, but
| assuredly it does get stuck and act stupidly on other tasks
| as well.
|
| My point, more than anything, is that... I'd try different
| editors and different (stronger) models and see - and that
| small tweaks to prompt and tooling are making a big
| difference to these tools' effectiveness right now. Also,
| different models seem to excel at different problems, so
| switching models is often a good choice.
| barrell wrote:
| Eh I am happy waiting many years before any of that. If it
| only work right with the right model for the right job, and
| it's very fuzzy which models work for which tasks, and the
| models change all the time (often times silently)... at
| some point it's just easier to do the easy task I'm trying
| to offload then juggle all off this.
|
| If and when I go about trying these tools in the future,
| I'll probably looks for and open source TUI, so keep up the
| great work on aider!
| sdesol wrote:
| > I've had it do some huge things that run for an hour or
| more,
|
| Can you clarify this? If I am reading this right, you let
| the llm think/generate output for an hour? This seems
| bonkers to me.
| tekacs wrote:
| Over the last two days, I've built out support for autonomy in
| Aider (a lot like Claude Code) that hybridizes with the rest of
| the app:
|
| https://github.com/Aider-AI/aider/pull/3781
|
| Edit: In case anyone wants to try it, I uploaded it to PyPI as
| `navigator-mode`, until (and if!) the PR is accepted. By I, I
| mean that it uploaded itself. You can see the session where it
| did that here: https://asciinema.org/a/9JtT7DKIRrtpylhUts0lr3EfY
|
| Edit 2: And as a Show HN, too:
| https://news.ycombinator.com/item?id=43674180
|
| and, because Aider's already an amazing platform without the
| autonomy, it's very easy to use the rest of Aider's options, like
| using `/ask` first, using `/code` or `/architect` for specific
| tasks [1], but if you start in `/navigator` mode (which I built,
| here), you can just... ask for a particular task to be done
| and... wait and it'll often 'just get done'.
|
| It's... decidedly expensive to run an LLM this way right now
| (Gemini 2.5 Pro is your best bet), but if it's $N today, I don't
| doubt that it'll be $0.N by next year.
|
| I don't mean to speak in meaningless hype, but I think that a lot
| of folks who are speaking to LLMs' 'inability' to do things are
| also spending relatively cautiously on them, when tomorrow's
| capabilities are often here, just pricey.
|
| I'm definitely still intervening as it goes (as in the Devin
| demos, say), but I'm also having LLMs relatively autonomously
| build out large swathes of functionality, the kind that I would
| put off or avoid without them. I wouldn't call it a programmer-
| replacement any time soon (it feels far from that), but I'm solo
| finishing architectures now that I know how to build, but where
| delegating them to a team of senior devs would've resulted in
| chaos.
|
| [1]: also for anyone who hasn't tried it and doesn't like TUI, do
| note that Aider has a web mode and a 'watch mode', where you can
| use your normal editor and if you leave a comment like '# make
| this darker ai!', Aider will step in and apply the change. This
| is even fancier with navigator/autonomy.
| nico wrote:
| > It's... decidedly expensive to run an LLM this way right now
|
| Does it work ok with local models? Something like the quantized
| deepseeks, gemma3 or llamas?
| tekacs wrote:
| It does for me, yes -- models seem to be pretty capable of
| adhering to the tool call format, which is really all that
| they 'need' in order to do a good job.
|
| I'm still tweaking the prompts (and I've introduced a new,
| tool-call based edit format as a primary replacement to
| Aider's usual SEARCH/REPLACE, which is both easier and harder
| for LLMs to use - but it allows them to better express e.g.
| 'change the name of this function').
|
| So... if you have any trouble with it, I would adjust the
| prompts (in `navigator_prompts.py` and
| `navigator_legacy_prompts.py` for non-tool-based editing). In
| particular when I adopted more 'terseness and proactively
| stop' prompting, weaker LLMs started stopping prematurely
| more often. It's helpful for powerful thinking models (like
| Sonnet and Gemini 2.5 Pro), but for smaller models I might
| need to provide an extra set of prompts that let them roam
| more.
| regularfry wrote:
| Since you've got the aider hack session going...
|
| One thing I've had in the back of my brain for a few days is
| the idea of LLM-as-a-judge over a multi-armed bandit, testing
| out local models. Locally, if you aren't too fussy about how
| long things take, you can spend all the tokens you want.
| Running head-to-head comparisons is slow, but with a MAB you're
| not doing so for every request. Nine times out of ten it's the
| normal request cycle. You could imagine having new models get
| mixed in as and when they become available, able to take over
| if they're genuinely better, entirely behind the scenes. You
| don't need to manually evaluate them at that point.
|
| I don't know how well that gels with aider's modes; it feels
| like you want to be able to specify a judge model but then have
| it control the other models itself. I don't know if that's
| better within aider itself (so it's got access to the added
| files to judge a candidate solution against, and can directly
| see the evaluation) or as an API layer between aider and the
| vllm/ollama/llama-server/whatever service, with the
| complication of needing to feed scores out of aider to stoke
| the MAB.
|
| You could extend the idea to generating and comparing system
| prompts. That might be worthwhile but it feels more like
| tinkering at the edges.
|
| Does any of that sound feasible?
| tekacs wrote:
| It's funny you say this! I was adding a tool just earlier
| (that I haven't yet pushed) that allows the model to...
| switch model.
|
| Aider can also have multiple models active at any time (the
| architect, editor and weak model is the standard set) and use
| them for different aspects. I could definitely imagine
| switching one model whilst leaving another active.
|
| So yes, this definitely seems feasible.
|
| Aider had a fairly coherent answer to this question, I think:
| https://gist.github.com/tekacs/75a0e3604bc10ea88f9df9a909b5d.
| ..
|
| This was navigator mode + Gemini 2.5 Pro's attempt at
| implementing it, based only on pasting in your comment:
|
| https://asciinema.org/a/EKhno9vQlqk9VkYizIxsY8mIr
|
| https://github.com/tekacs/aider/commit/6b8b76375a9b43f9db785.
| ..
|
| I think it did a fairly good job! It took just a couple of
| minutes and it effectively just switches the main model based
| on recent input, but I don't doubt that this could become
| really robust if I had poked or prompted it further with
| preferences, ideas, beliefs and pushback! I imagine that you
| could very quickly get it there if you wished.
|
| It's definitely not showing off the most here, because it's
| almost all direct-coding, very similar to ordinary Aider. :)
| gandalfgeek wrote:
| Very cool. Even cooler to see it upload itself!!
| aqme28 wrote:
| It's cute but I don't see the benefit. In my experience, if one
| LLM fails to solve a problem, the other ones won't be too
| different.
|
| If you picked a problem where LLMs are good, now you have to
| review 3 PRs instead of just 1. If you picked a problem where
| they're bad, now you have 3 failures.
|
| I think there are not many cases where throwing more attempts at
| the problem is useful.
| denidoman wrote:
| The current challenge is not to create a patch, but to verify it.
|
| Testing a fix in a big application is a very complex task. First
| of all, you have to reproduce the issue, to verify steps (or
| create them, because many issues don't contain clear
| description). Then you should switch to the fixed version and
| make sure that the issue doesn't exists. Finally, you should
| apply little exploratory testing to make sure that the fix
| doesn't corrupted neighbour logic (deep application knowledge
| required to perform it).
|
| To perform these steps you have to deploy staging with the
| original/fixed versions or run everything locally and do pre-
| setup (create users, entities, etc. to achieve the corrupted
| state).
|
| This is very challenging area for the current agents. Now they
| just can't do these steps - their mental models just not ready
| for a such level of integration into the app and infra. And
| creation of 3/5/10/100 unverified pull requests just slow down
| software development process.
| gandalfgeek wrote:
| There is no fundamental blocker to agents doing all those
| things. Mostly a matter of constructing the right tools and
| grounding, which can be fair amount of up-front work. Arming
| LLMs with the right tools and documentation got us this far.
| There's no reason to believe that path is exhausted.
| dimitri-vs wrote:
| Have you tried building agents? They will go from PhD level
| smart to making mistakes a middle schooler would find
| obvious, even on models like gemini-2.5 and o1-pro. It's
| almost like building a sandcastle where once you get a prompt
| working you become afraid to make any changes because
| something else will break.
| sdesol wrote:
| > Have you tried building agents?
|
| I think the issue right now is so many people want to
| believe in the moonshot and are investing heavily in it,
| when the reality is we should be focusing on the home runs.
| LLMs are a game changer, but there is still A LOT of
| tooling that can be created to make it easier to integrate
| humans in the loop.
| tough wrote:
| you can just even tell cursor to use any cli tools you use
| normally in your development, like git, gh, railway, vercel,
| node debugging, etc etc
| denidoman wrote:
| Tools is not the problem. Knowledge is.
| ghuntley wrote:
| Correct! Over at https://ghuntley.com/mcp I propose that each
| company develops their own tools for their particular
| codebase that shapes LLM actions on how to work with their
| codebase.
| denidoman wrote:
| Look at this 18 years old Django ticket:
| https://code.djangoproject.com/ticket/4140
|
| It was impossible to fix, but it required some experiments
| and deep research about very specific behaviors.
|
| Or this ticket: https://code.djangoproject.com/ticket/35289
|
| Author proposed one-line solution, but the following
| discussion includes analysis of RFC, potential negative
| outcomes, different ways to fix it.
|
| And without deep understanding of the project - it's not
| clear how to fix it properly, without damage to backward
| compatibility and neighbor functionality.
|
| Also such a fix must be properly tested manually, because
| even well designed autotests are not 100% match the actual
| flow.
|
| You can explore other open and closed issues and
| corresponding discussions. And this is the complexity level
| of real software, not pet projects or simple apps.
|
| I guess that existing attention mechanism is the fundamental
| blocker, because it barely able to process all the context
| required for a fix.
|
| And feature requests a much, much more complex.
| ghurtado wrote:
| All the things you describe are already being done by any team
| with a modern CI/CD workflow, and none of it requires AI.
|
| At my last job, all of those steps were automated and required
| exactly zero human input.
| denidoman wrote:
| Are you sure about "all"? Because I mentioned not only env
| deployment, but also functional issue reproduction using
| UI/API, which is also require necessary pre-setup.
|
| Automated tests partially solve the case, but in real world
| no one writes tests blindly. It's always manual work, and
| when the failing trajectory is clear - the test is written.
|
| Theoretically agent can interact with UI or API. But it
| requires deep project understanding, gathered from code,
| documentation, git history, tickets, slack. And obtaining
| this context, building an easily accessible knowledge base
| and puring only necessary parts into the agent context - is
| still a not solved task.
| lherron wrote:
| I love this! I have a similar automation for moving a feature
| through ideation/requirements/technical design, but I usually
| dump the result into Cursor for last mile and to save on
| inference. Seeing the cost analysis is eye opening.
|
| There's probably also some upside to running the same model
| multiple times. I find Sonnet will sometimes fail, I'll roll back
| and try again with same prompt but clean context, and it will
| succeed.
| ghuntley wrote:
| re: cost analysis
|
| There's something cooked about Windsurf/Cursors' go-to-market
| pricing - there's no way they are turning a profit at
| $50/month. $50/month gets you a happy meal experience. If you
| want more power, you gotta ditch snacking at McDonald's.
|
| In the future, companies should budget $100 USD to $500 USD per
| day, per dev, on tokens as the new normal for business, which
| is circa $25k USD (low end) to $50k USD (likely) to $127k USD
| (highest) per year.
|
| Above from https://ghuntley.com/redlining/
|
| This napkin math is based upon my current spend in bring a
| self-compiled compiler to life.
| precompute wrote:
| Feels like a way to live with a bad decision rather than getting
| rid of it.
| pton_xd wrote:
| The trend with LLMs so far has been: if you have an issue with
| the AI, wait 6 months for a more advanced model. Cobbling
| together workarounds for their deficiencies is basically a waste
| of effort.
| KTibow wrote:
| I wonder if using thinking models would work better here. They
| generally have less variance and consider more options, which
| could achieve the same goal.
| canterburry wrote:
| I wouldn't be surprised if someone tries to leverage this with
| their customer feature request tool.
|
| Imagine having your customers write feature requests for your
| saas, that immediately triggers code generation and a PR. A
| virtual environment with that PR is spun up and served to that
| customer for feedback and refinement. Loop until customer has
| implemented the feature they would like to see in your product.
|
| Enterprise plan only, obviously.
| kgeist wrote:
| I've noticed that large models from different vendors often end
| up converging on more or less the same ideas (probably because
| they're trained on more or less the same data). A few days ago, I
| asked both Grok and ChatGPT to produce several stories with an
| absurd twist, and they consistently generated the same twists,
| differing only in minor details. Often, they even used identical
| wording!
|
| Is there any research into this phenomenon? Is code generation
| any different? Isn't there a chance that several "independent"
| models might produce the same (say, faulty) result?
___________________________________________________________________
(page generated 2025-04-13 23:00 UTC)