hngopher.com

       [HN Gopher] Wasting Inferences with Aider
       ___________________________________________________________________
        
       Wasting Inferences with Aider
        
       Author : Stwerner
       Score  : 102 points
       Date   : 2025-04-13 13:36 UTC (9 hours ago)
        
 (HTM) web link (worksonmymachine.substack.com)
 (TXT) w3m dump (worksonmymachine.substack.com)
        
       | evertedsphere wrote:
       | love to see "Why It Matters" turn into the heading equivalent to
       | "delve" in body text (although different in that the latter is a
       | legitimate word while the former is a "we need to talk
       | about..."-level turn of phrase)
        
       | DeathArrow wrote:
       | I don't really think having an agent fleet is a much better
       | solution than having a single agent.
       | 
       | We would like to think that having 10 agents working on the same
       | task will improve the chances of success 10x.
       | 
       | But I would argue that some classes of problems are hard for LLMs
       | and where one agent will fail, 10 agents or 100 agents will fail
       | too.
       | 
       | As an easy example I suggest leetcode hard problems.
        
         | adhamsalama wrote:
         | We need The Mythical Man-Month: LLM version book.
        
         | skeledrew wrote:
         | The fleet approach can work well particularly because: 1)
         | different models are trained differently, even though using
         | mostly same data (think someone who studied SWE at MIT, vs one
         | who studied at Harvard), 2) different agents can be given
         | different prompts, which specializes their focus (think coder
         | vs reviewer), and 3) the context window content influences the
         | result (think someone who's seen the history of implementation
         | attempts, vs one seeing a problem for the first time). Put
         | those traits in various combinations and the results will be
         | very different from a single agent.
        
         | regularfry wrote:
         | Nit: it doesn't 10x the chance of success, it (the chance of
         | failure)^10.
        
           | eMPee584 wrote:
           | neither, probably
        
         | ghuntley wrote:
         | I'm authoring a self-compiling compiler with custom lexical
         | tokens via LLM. I'm almost at stage 2, and approximately 50
         | "stdlib" concerns have specifications authored for them.
         | 
         | The idea of doing them individually in the IDE is very
         | unappealing. Now that the object system, ast, lexer, parser,
         | and garbage collection have stabilized, the codebase is at a
         | point where fanning out agents makes sense.
         | 
         | As stage 3 nears, it won't make sense to fan out until the
         | fundamentals are ready again/stabilised, but at that point,
         | I'll need to fan out again.
         | 
         | https://x.com/GeoffreyHuntley/status/1911031587028042185
        
       | joshstrange wrote:
       | This is a very interesting idea and I really should consider
       | Aider in the "scriptable" sense more, I only use interactively.
       | 
       | I might add another step after each PR is created where another
       | agent(s?) review and compare the results (maybe have the other 2
       | agents review the first agents code?).
        
         | Stwerner wrote:
         | Thanks, and having another step for reviewing each other's code
         | is a really cool extension to this, I'll give it a shot :)
         | Whether it works or it doesn't it could be really interesting
         | for a future post!
        
           | brookst wrote:
           | Wonder if you could have the reviewer characterize any
           | mistakes and feed those back into the coding prompt: "be sure
           | to... be sure not to..."
        
       | IshKebab wrote:
       | We're going to have no traditional programming in 2 years?
       | Riiight.
       | 
       | It would also be nice to see a demo where the task was something
       | that I couldn't have done myself in essentially no time. Like,
       | what happens if you say "tasks should support tags, and you
       | should be able to filter/group tasks by tag"?
        
         | Stwerner wrote:
         | Gave it a shot real quick, looks like I need to fix something
         | up about automatically running the migrations either in the CI
         | script or locally...
         | 
         | But if you're curious, task was this:
         | 
         | ----
         | 
         | Title: Bug: Users should be able to add tags to a task to
         | categorize them
         | 
         | Description: Users should be able to add multiple tags to a
         | task but aren't currently able to.
         | 
         | Given I am a user with multiple tasks When I select one Then I
         | should be able to add one or many tags to it
         | 
         | Given I am a user with multiple tasks each with multiple tags
         | When I view the list of tasks Then I should be able to see the
         | tags associated with each task
         | 
         | ----
         | 
         | And then we ended up with:
         | 
         | GPT-4o ($0.05):
         | https://github.com/sublayerapp/buggy_todo_app/pull/51
         | 
         | Claude 3.5 Sonnet ($0.09):
         | https://github.com/sublayerapp/buggy_todo_app/pull/52
         | 
         | Gemini 2.0 Flash ($0.0018):
         | https://github.com/sublayerapp/buggy_todo_app/pull/53
         | 
         | One thing to note that I've found - I know you had the "...and
         | you should be able to filter/group tasks by tag" on the request
         | - usually when you have a request that is "feature A AND
         | feature B" you get better results when you break it down into
         | smaller pieces and apply them one by one. I'm pretty confident
         | that if I spent time to get the migrations running, we'd be
         | able to build that request out story-by-story as long as we
         | break it out into bite-sized pieces.
        
           | IanCal wrote:
           | You can have a larger model split things out into more
           | manageable steps and create new tickets - marked as blocked
           | or not on each other, then have the whole thing run.
        
         | victorbjorklund wrote:
         | Wouldnt AI be perfect for those easy tasks? They still take
         | time if you wanna do it "properly" with a new branch etc. I get
         | lots of "can you change the padding for that component". And
         | that is all. Is it easy? Sure. But still takes time to open the
         | project, create a new branch, make the change, push the change,
         | create a merge request, etc. That probably takes me 10 min.
         | 
         | If I could just let the AI do all of them and just go in and
         | check the merge requests and approve them it would save me
         | time.
        
       | emorning3 wrote:
       | I see 'Waste Inferences' as a form of abductive reasoning.
       | 
       | I see LLMs as a form of inductive reasoning, and so I can see how
       | WI could extend LLMs.
       | 
       | Also, I have no doubt that there are problems that can't be
       | solved with just an LLM but would need abductive extensions.
       | 
       | Same comments apply to deductive (logical) extensions to LLMs.
        
       | phamilton wrote:
       | Sincere question: Has anyone figured out how we're going to code
       | review the output of an agent fleet?
        
         | jsheard wrote:
         | Insincere answer that will probably be attempted sincerely
         | nonetheless: throw even more agents at the problem by having
         | them do code review as well. The solution to problems caused by
         | AI is always more AI.
        
           | brookst wrote:
           | s/AI/tech
        
           | regularfry wrote:
           | Technically that's known as "LLM-as-judge" and it's all over
           | the literature. The intuition would be that the capability to
           | choose between two candidates doesn't exactly overlap with
           | the ability to generate either one of them from scratch. It's
           | a bit like how (half of) generative adversarial networks
           | work.
        
         | fxtentacle wrote:
         | You just don't. Choose randomly and then try to quickly sell
         | the company. /s
        
         | lsllc wrote:
         | Simple, just ask an(other) AI! But seriously, different models
         | are better/worse at different tasks, so if you can figure out
         | which model is best at evaluating changes, use that for the
         | review.
        
         | nchmy wrote:
         | sincere question: why would you not be able to code review it
         | in the same way you would for humans?
        
         | sensanaty wrote:
         | Most of the people pushing this want to just sell an MVP and
         | get a big exit before everything collapses, so code review is
         | irrelevant.
        
       | danenania wrote:
       | Plandex[1] uses a similar "wasteful" approach for file edits
       | (note: I'm the creator). It orchestrates a race between diff-
       | style replacements plus validation, writing the whole file with
       | edits incorporated, and (on the cloud service) a specialized
       | model plus validation.
       | 
       | While it sounds wasteful, the calls are all very cheap since most
       | of the input tokens are cached, and once a valid result is
       | achieved, other in-flight requests are cancelled. It's working
       | quite well, allowing for quick results on easy edits with
       | fallbacks for more complex changes/large files that don't feel
       | incredibly slow.
       | 
       | 1 - https://github.com/plandex-ai/plandex
        
       | billmalarky wrote:
       | I've been lucky enough to have a few conversations with Scott a
       | month or so ago and he is doing some really compelling work
       | around the AISDLC and creating a factory line approach to
       | building software. Seriously folks, I recommend following this
       | guy closely.
       | 
       | There's another guy in this space I know who's doing similar
       | incredible things but he doesn't really speak about it publicly
       | so don't want to discuss w/o his permission. I'm happy to make an
       | introduction for those interested just hmu (check my profile for
       | how).
       | 
       | Really excited to see you on the FP of HN Scott!
        
       | fxtentacle wrote:
       | For me, a team of junior developers that refuse to learn from
       | their mistakes is the fuel of nightmares. I'm stuck in a loop
       | where every day I need to explain to a new hire why they made the
       | exact same beginner's mistake as the last person on the last day.
       | Eventually, I'd rather spend half an hour of my own time than to
       | explain the problem once more...
       | 
       | Why anyone thinks having 3 different PRs for each jira ticket
       | might boost productivity, is beyond me.
       | 
       | Related anime: I May Be a Guild Receptionist, But I'll Solo Any
       | Boss to Clock Out on Time
        
         | simonw wrote:
         | One of the (many) differences between junior developers and LLM
         | assistance is that humans can learn from their mistakes,
         | whereas with LLMs it's up to you as the prompter to learn from
         | their mistakes.
         | 
         | If an LLM screws something up you can often adjust their prompt
         | to avoid that particular problem in the future.
        
         | abc-1 wrote:
         | Darn I wonder if systems could be modified so that common
         | mistakes become less common or if documentation could be
         | written once and read multiple times by different people.
        
           | danielbln wrote:
           | We feed it conventions that are automatically loaded for
           | every LLM task, do that the LLM adheres to coding style,
           | comment style, common project tooling and architecture etc.
           | 
           | These systems don't do online learning, but that doesn't mean
           | you can spoon feed them what they should know and mutate that
           | knowledge over time.
        
         | noodletheworld wrote:
         | It may not be as stupid as it sounds.
         | 
         | Randomising LLM outputs (temperature) results is outputs that
         | will always have some degree of hallucination.
         | 
         | That's just math. You can't mix a random factor in and
         | magically expect it to not exist. There will always be
         | p(generates random crap) > 0.
         | 
         | However, in any probabilistic system, you can run a function k
         | times and you'll get an output distribution that is meaningful
         | if k is high enough.
         | 
         | 3 is not high enough.
         | 
         | At 3, this is stupid; all you're observing is random variance.
         | 
         | ...but, _in general_ , running the same prompt multiple times
         | and taking some kind of general solution from the distribution
         | isn't totally meaningless, I guess.
         | 
         | The thing with LLMs is they scale in a way that actually allows
         | this to be possible, in a way that scaling with humans can't.
         | 
         | ... like the monkeys and Shakespeare, there probably a limit to
         | the value it can offer; but it's not totally meaningless to try
         | it.
        
           | horsawlarway wrote:
           | I think this is an interesting idea, but I also somewhat
           | suspect you've replaced a tedious problem with a harder, more
           | tedious problem.
           | 
           | Take your idea further. Now I've got 100 agents, and 100 PRs,
           | and some small percentage of them are decent. The task went
           | from "implement a feature" to "review 100 PRs and select the
           | best one".
           | 
           | Even assuming you can ditch 50 percent right off the bat as
           | trash... Reviewing 50 potentially buggy implementations of a
           | feature and selecting the best genuinely sounds worse than
           | just writing the solution.
           | 
           | Worse... If you haven't solved the problem before anyways,
           | you're woefully unqualified as a reviewer.
        
             | skeledrew wrote:
             | There should be test cases ran, coverage ensured. This is
             | trivially automated. LLMs should also review the PRs, at
             | least initially, using the test results as part of the
             | input.
        
               | horsawlarway wrote:
               | So now either the agent is writing the tests, in which
               | case you're right back to the same issue (which tests are
               | actually worth running?) or your job is now just writing
               | tests (bleh...).
               | 
               | And for the llm review of the pr... Why do you assume
               | it'll be worth any more then the original implementation?
               | Or are we just recursing down a level again (if 100 llms
               | review each of the 100 PRs... To infinity and beyond!)
               | 
               | This by definition is _not_ trivially automated.
        
               | skeledrew wrote:
               | The LLMs can help with the writing of the tests, but you
               | should verify that they're testing critical aspects and
               | known edge cases are covered. A single review-promoted
               | LLM can then utilize those across the PRs and provide a
               | summary for acceptance the the best. Or discard all and
               | do manually; that initial process should only have taken
               | a few minutes, so minimal wastage in the grand scheme of
               | things, given over time there are a decent amount of
               | acceptances, compared to the alternative 100% manual
               | effort and associated time sunk.
        
               | lolinder wrote:
               | Who tests the tests? How do you know that the LLM-
               | generated tests are actually asserting anything
               | meaningful and cover the relevant edge cases?
               | 
               | The tests are part of the code that needs to be reviewed
               | in the PR by a human. They don't solve the problem, they
               | just add more lines to the reviewer's job.
        
             | CognitiveLens wrote:
             | The linked article from Steve Yegge
             | (https://sourcegraph.com/blog/revenge-of-the-junior-
             | developer) provides a 'solution', which he thinks is also
             | imminent - supervisor AI agents, where you might have 100+
             | coding agents creating PRs, but then a layer of supervisors
             | that are specialized on evaluating quality, and the only
             | PRs that a human being would see would be the 'best', as
             | determined by the supervisor agent layer.
             | 
             | From my experience with AI agents, this feels intuitively
             | possible - current agents seem to be ok (thought not yet
             | 'great') at critiquing solutions, and such supervisor
             | agents could help keep the broader system in alignment.
        
           | regularfry wrote:
           | Three is high enough, in my eyes. Two might be. Remember that
           | we don't care about any but the best solution. With one
           | sample you've only got one 50/50 shot to get above the
           | median. With three, the odds of the best of the three being
           | above the median is 87.5%.
           | 
           | Of course picking the median as the random crap boundary is
           | entirely arbitrary, but it'll do until there's a
           | justification for a better number.
        
           | fn-mote wrote:
           | > taking some kind of general solution from the distribution
           | 
           | My instinct is that this should be the temperature 0K
           | response (no randomness).
        
         | nico wrote:
         | > For me, a team of junior developers that refuse to learn from
         | their mistakes is the fuel of nightmares. I'm stuck in a loop
         | where every day I need to explain to a new hire why they made
         | the exact same
         | 
         | This is a huge opportunity, maybe the next big breakthrough in
         | AI when someone figures out how to solve it
         | 
         | Instead of having a model that knows everything, have a model
         | that can learn on the go from the feedback it gets from the
         | user
         | 
         | Ideally a local model too. So something that runs on my
         | computer that I train with my own feedback so that it gets
         | better at the tasks I need it to perform
         | 
         | You could also have one at team level, a model that learns from
         | the whole team to perform the tasks the team needs it to
         | perform
        
           | freeone3000 wrote:
           | Continual feedback means continual training. No way around
           | it. So you'd have to scope down the functional unit to a
           | fairly small lora in order to get reasonable re-training
           | costs here.
        
             | nico wrote:
             | Or maybe figure out a different architecture
             | 
             | Either way, the end user experience would be vastly
             | improved
        
             | regularfry wrote:
             | That's not quite true. The system prompt is state that you
             | can use for "training" in a way that fits the problem here.
             | It's not differentiable so you're in slightly alien
             | territory, but it's also more comprehensible than gradient-
             | descending a bunch of weights.
        
           | sdesol wrote:
           | > This is a huge opportunity, maybe the next big breakthrough
           | in AI when someone figures out how to solve it
           | 
           | I am not saying I solved it, but I believe we are going to
           | experience a paradigm shift in how we program and teach and
           | for some, they are really going to hate it. With AI, we can
           | now easily capture the thought process for how we solve
           | problems, but there is a catch. For this to work, senior
           | developers will need to come to terms that their value is not
           | in writing code, but solving problems.
           | 
           | I would say 98% of my code is now AI generated and I have 0%
           | fear that it will make me dumber. I will 100% become less
           | proficient in writing code, but my problem solving skills
           | will not go away and will only get sharper. In the example
           | below, 100% of the code/documentation was AI generated, but I
           | still needed to guide Gemini 2.5 Pro
           | 
           | https://app.gitsense.com/?chat=c35f87c5-5b61-4cab-873b-a3988.
           | ..
           | 
           | After reviewing the code, it was clear what the problem was
           | and since I didn't want to waste token and time, I literally
           | suggested the implementation and told it to not generate any
           | code, but asks it to explain the problem and the solution, as
           | shown below.
           | 
           | > The bug is still there. Why does it not use states to to
           | track the start @@ and end @@? If you encounter @@ , you can
           | do an if else on the line by asking if the line ends with @@.
           | If so, you can change the state to expect replacement start
           | delimiter. If it does not end with @@ you can set the state
           | to expect line to end with @@ and not start with @@. Do you
           | understand? Do not generate any code yet.
           | 
           | How I see things evolving over time is, senior developers
           | will start to code less and less and the role for junior
           | developers will not only be to code but to review
           | conversations. As we add new features and fix bugs, we will
           | start to link to conversations that Junior developers can
           | learn from. The Dooms day scenario is obviously, with enough
           | conversations, we may reach the point where AI can solve most
           | problems one shot.
           | 
           | Full Disclosure: This is my tool
        
             | ghuntley wrote:
             | > For this to work, senior developers will need to come to
             | terms that their value is not in writing code, but solving
             | problems.
             | 
             | This is the key reason behind authoring
             | https://ghuntley.com/ngmi - developers that come to terms
             | with the new norm will flourish yet the developers who
             | don't will struggle in corporate...
        
               | sdesol wrote:
               | Nice write up. I don't necessary think it is "if you
               | adopt it, you will flourish" as much as, if you have this
               | type of "personality you will easily become 10x, if you
               | have this type of personality you will become 2x and if
               | you have this type, you will become .5x".
               | 
               | I'm obviously biased, but I believe developers with a
               | technical entrepreneur mindset, will see the most
               | benefit. This paradigm shift requires the ability to
               | properly articulate your thoughts and be able to create
               | problem statements for every action. And honestly, not
               | everybody can do this.
               | 
               | Obviously, a lot depends on the problems being solved and
               | how well trained the LLM is in that person's domain. I
               | had Claude and a bunch of other models write my GitSense
               | Chat Bridge code which makes it possible to bring Git's
               | history into my chat app and it is slow as hell. It works
               | most of the time, but it was obvious that the design
               | pattern was based on simple CRUD apps. And this is where
               | LLMs will literally slow you down and I know this because
               | I already solved this problem. The LLM generated chat
               | bridge code will be free and open sourced but I will
               | charge for my optimized indexing engine.
        
         | bbatchelder wrote:
         | Even with human junior devs, ideally you'd maintain some
         | documentation about common mistakes/gotchas so that when you
         | onboard new people to the team they can read that instead of
         | you having to hold their hand manually.
         | 
         | You can do the same thing for LLMs by keeping a file with those
         | details available and included in their context.
         | 
         | You can even set up evaluation loops so that entries can be
         | made by other agents.
        
           | ghuntley wrote:
           | correct.
        
       | wrs wrote:
       | I've been using Cursor and Code regularly for a few months now
       | and the idea of letting three of them run free on the codebase
       | seems insane. The reason for the chat interface is that the agent
       | goes off the rails on a regular basis. At least 25% of the time I
       | have to hit the stop button and go back to a checkpoint because
       | the automatic lawnmower has started driving through the flowerbed
       | again. And paradoxically, the more capable the model gets, the
       | more likely it seems to get random ideas of how to fix things
       | that aren't broken.
        
         | barrell wrote:
         | Had a similar experience with Claude Code lately. I got a
         | notice some credits were expiring, so I opened up Claude Code
         | and asked it to fix all the credo errors in an elixir project
         | (style guide enforcement).
         | 
         | I gave it incredibly clear steps of what to run in what
         | process, maybe 6 steps, 4 of which were individual severity
         | levels.
         | 
         | Within a few minutes it would as to commit code, create
         | branches, run tests, start servers -- always something new,
         | none of which were in my instructions. It would also often run
         | mix credo, get a list of warnings, deem them unimportant, then
         | try to go do its own thing.
         | 
         | It was really cool, I basically worked through 1000 formatting
         | errors in 2 hours with $40 of credits (that I would have had no
         | use for otherwise).
         | 
         | But man, I can't imagine letting this thing run a single
         | command without checking the output
        
           | tekacs wrote:
           | So... I know that people frame these sorts of things as if
           | it's some kind of quantization conspiracy, but as someone who
           | started using Claude Code the _moment_ that it came out, it
           | felt particularly strong. Then, it feels like they... tweaked
           | something, whether in CC or Sonnet 3.7 and it went a little
           | downhill. It's still very impressive, but something was lost.
           | 
           | I've found Gemini 2.5 Pro to be extremely impressive and much
           | more able to run in an extended fashion by itself, although
           | I've found very high variability in how well 'agent mode'
           | works between different editors. Cursor has been very very
           | weak in this regard for me, with Windsurf working a little
           | better. Claude Code is excellent, but at the moment does feel
           | let down by the model.
           | 
           | I've been using Aider with Gemini 2.5 Pro and found that it's
           | very much able to 'just go' by itself. I shipped a mode for
           | Aider that lets it do so (sibling comment here) and I've had
           | it do some huge things that run for an hour or more, but
           | assuredly it does get stuck and act stupidly on other tasks
           | as well.
           | 
           | My point, more than anything, is that... I'd try different
           | editors and different (stronger) models and see - and that
           | small tweaks to prompt and tooling are making a big
           | difference to these tools' effectiveness right now. Also,
           | different models seem to excel at different problems, so
           | switching models is often a good choice.
        
             | barrell wrote:
             | Eh I am happy waiting many years before any of that. If it
             | only work right with the right model for the right job, and
             | it's very fuzzy which models work for which tasks, and the
             | models change all the time (often times silently)... at
             | some point it's just easier to do the easy task I'm trying
             | to offload then juggle all off this.
             | 
             | If and when I go about trying these tools in the future,
             | I'll probably looks for and open source TUI, so keep up the
             | great work on aider!
        
             | sdesol wrote:
             | > I've had it do some huge things that run for an hour or
             | more,
             | 
             | Can you clarify this? If I am reading this right, you let
             | the llm think/generate output for an hour? This seems
             | bonkers to me.
        
       | tekacs wrote:
       | Over the last two days, I've built out support for autonomy in
       | Aider (a lot like Claude Code) that hybridizes with the rest of
       | the app:
       | 
       | https://github.com/Aider-AI/aider/pull/3781
       | 
       | Edit: In case anyone wants to try it, I uploaded it to PyPI as
       | `navigator-mode`, until (and if!) the PR is accepted. By I, I
       | mean that it uploaded itself. You can see the session where it
       | did that here: https://asciinema.org/a/9JtT7DKIRrtpylhUts0lr3EfY
       | 
       | Edit 2: And as a Show HN, too:
       | https://news.ycombinator.com/item?id=43674180
       | 
       | and, because Aider's already an amazing platform without the
       | autonomy, it's very easy to use the rest of Aider's options, like
       | using `/ask` first, using `/code` or `/architect` for specific
       | tasks [1], but if you start in `/navigator` mode (which I built,
       | here), you can just... ask for a particular task to be done
       | and... wait and it'll often 'just get done'.
       | 
       | It's... decidedly expensive to run an LLM this way right now
       | (Gemini 2.5 Pro is your best bet), but if it's $N today, I don't
       | doubt that it'll be $0.N by next year.
       | 
       | I don't mean to speak in meaningless hype, but I think that a lot
       | of folks who are speaking to LLMs' 'inability' to do things are
       | also spending relatively cautiously on them, when tomorrow's
       | capabilities are often here, just pricey.
       | 
       | I'm definitely still intervening as it goes (as in the Devin
       | demos, say), but I'm also having LLMs relatively autonomously
       | build out large swathes of functionality, the kind that I would
       | put off or avoid without them. I wouldn't call it a programmer-
       | replacement any time soon (it feels far from that), but I'm solo
       | finishing architectures now that I know how to build, but where
       | delegating them to a team of senior devs would've resulted in
       | chaos.
       | 
       | [1]: also for anyone who hasn't tried it and doesn't like TUI, do
       | note that Aider has a web mode and a 'watch mode', where you can
       | use your normal editor and if you leave a comment like '# make
       | this darker ai!', Aider will step in and apply the change. This
       | is even fancier with navigator/autonomy.
        
         | nico wrote:
         | > It's... decidedly expensive to run an LLM this way right now
         | 
         | Does it work ok with local models? Something like the quantized
         | deepseeks, gemma3 or llamas?
        
           | tekacs wrote:
           | It does for me, yes -- models seem to be pretty capable of
           | adhering to the tool call format, which is really all that
           | they 'need' in order to do a good job.
           | 
           | I'm still tweaking the prompts (and I've introduced a new,
           | tool-call based edit format as a primary replacement to
           | Aider's usual SEARCH/REPLACE, which is both easier and harder
           | for LLMs to use - but it allows them to better express e.g.
           | 'change the name of this function').
           | 
           | So... if you have any trouble with it, I would adjust the
           | prompts (in `navigator_prompts.py` and
           | `navigator_legacy_prompts.py` for non-tool-based editing). In
           | particular when I adopted more 'terseness and proactively
           | stop' prompting, weaker LLMs started stopping prematurely
           | more often. It's helpful for powerful thinking models (like
           | Sonnet and Gemini 2.5 Pro), but for smaller models I might
           | need to provide an extra set of prompts that let them roam
           | more.
        
         | regularfry wrote:
         | Since you've got the aider hack session going...
         | 
         | One thing I've had in the back of my brain for a few days is
         | the idea of LLM-as-a-judge over a multi-armed bandit, testing
         | out local models. Locally, if you aren't too fussy about how
         | long things take, you can spend all the tokens you want.
         | Running head-to-head comparisons is slow, but with a MAB you're
         | not doing so for every request. Nine times out of ten it's the
         | normal request cycle. You could imagine having new models get
         | mixed in as and when they become available, able to take over
         | if they're genuinely better, entirely behind the scenes. You
         | don't need to manually evaluate them at that point.
         | 
         | I don't know how well that gels with aider's modes; it feels
         | like you want to be able to specify a judge model but then have
         | it control the other models itself. I don't know if that's
         | better within aider itself (so it's got access to the added
         | files to judge a candidate solution against, and can directly
         | see the evaluation) or as an API layer between aider and the
         | vllm/ollama/llama-server/whatever service, with the
         | complication of needing to feed scores out of aider to stoke
         | the MAB.
         | 
         | You could extend the idea to generating and comparing system
         | prompts. That might be worthwhile but it feels more like
         | tinkering at the edges.
         | 
         | Does any of that sound feasible?
        
           | tekacs wrote:
           | It's funny you say this! I was adding a tool just earlier
           | (that I haven't yet pushed) that allows the model to...
           | switch model.
           | 
           | Aider can also have multiple models active at any time (the
           | architect, editor and weak model is the standard set) and use
           | them for different aspects. I could definitely imagine
           | switching one model whilst leaving another active.
           | 
           | So yes, this definitely seems feasible.
           | 
           | Aider had a fairly coherent answer to this question, I think:
           | https://gist.github.com/tekacs/75a0e3604bc10ea88f9df9a909b5d.
           | ..
           | 
           | This was navigator mode + Gemini 2.5 Pro's attempt at
           | implementing it, based only on pasting in your comment:
           | 
           | https://asciinema.org/a/EKhno9vQlqk9VkYizIxsY8mIr
           | 
           | https://github.com/tekacs/aider/commit/6b8b76375a9b43f9db785.
           | ..
           | 
           | I think it did a fairly good job! It took just a couple of
           | minutes and it effectively just switches the main model based
           | on recent input, but I don't doubt that this could become
           | really robust if I had poked or prompted it further with
           | preferences, ideas, beliefs and pushback! I imagine that you
           | could very quickly get it there if you wished.
           | 
           | It's definitely not showing off the most here, because it's
           | almost all direct-coding, very similar to ordinary Aider. :)
        
         | gandalfgeek wrote:
         | Very cool. Even cooler to see it upload itself!!
        
       | aqme28 wrote:
       | It's cute but I don't see the benefit. In my experience, if one
       | LLM fails to solve a problem, the other ones won't be too
       | different.
       | 
       | If you picked a problem where LLMs are good, now you have to
       | review 3 PRs instead of just 1. If you picked a problem where
       | they're bad, now you have 3 failures.
       | 
       | I think there are not many cases where throwing more attempts at
       | the problem is useful.
        
       | denidoman wrote:
       | The current challenge is not to create a patch, but to verify it.
       | 
       | Testing a fix in a big application is a very complex task. First
       | of all, you have to reproduce the issue, to verify steps (or
       | create them, because many issues don't contain clear
       | description). Then you should switch to the fixed version and
       | make sure that the issue doesn't exists. Finally, you should
       | apply little exploratory testing to make sure that the fix
       | doesn't corrupted neighbour logic (deep application knowledge
       | required to perform it).
       | 
       | To perform these steps you have to deploy staging with the
       | original/fixed versions or run everything locally and do pre-
       | setup (create users, entities, etc. to achieve the corrupted
       | state).
       | 
       | This is very challenging area for the current agents. Now they
       | just can't do these steps - their mental models just not ready
       | for a such level of integration into the app and infra. And
       | creation of 3/5/10/100 unverified pull requests just slow down
       | software development process.
        
         | gandalfgeek wrote:
         | There is no fundamental blocker to agents doing all those
         | things. Mostly a matter of constructing the right tools and
         | grounding, which can be fair amount of up-front work. Arming
         | LLMs with the right tools and documentation got us this far.
         | There's no reason to believe that path is exhausted.
        
           | dimitri-vs wrote:
           | Have you tried building agents? They will go from PhD level
           | smart to making mistakes a middle schooler would find
           | obvious, even on models like gemini-2.5 and o1-pro. It's
           | almost like building a sandcastle where once you get a prompt
           | working you become afraid to make any changes because
           | something else will break.
        
             | sdesol wrote:
             | > Have you tried building agents?
             | 
             | I think the issue right now is so many people want to
             | believe in the moonshot and are investing heavily in it,
             | when the reality is we should be focusing on the home runs.
             | LLMs are a game changer, but there is still A LOT of
             | tooling that can be created to make it easier to integrate
             | humans in the loop.
        
           | tough wrote:
           | you can just even tell cursor to use any cli tools you use
           | normally in your development, like git, gh, railway, vercel,
           | node debugging, etc etc
        
             | denidoman wrote:
             | Tools is not the problem. Knowledge is.
        
           | ghuntley wrote:
           | Correct! Over at https://ghuntley.com/mcp I propose that each
           | company develops their own tools for their particular
           | codebase that shapes LLM actions on how to work with their
           | codebase.
        
           | denidoman wrote:
           | Look at this 18 years old Django ticket:
           | https://code.djangoproject.com/ticket/4140
           | 
           | It was impossible to fix, but it required some experiments
           | and deep research about very specific behaviors.
           | 
           | Or this ticket: https://code.djangoproject.com/ticket/35289
           | 
           | Author proposed one-line solution, but the following
           | discussion includes analysis of RFC, potential negative
           | outcomes, different ways to fix it.
           | 
           | And without deep understanding of the project - it's not
           | clear how to fix it properly, without damage to backward
           | compatibility and neighbor functionality.
           | 
           | Also such a fix must be properly tested manually, because
           | even well designed autotests are not 100% match the actual
           | flow.
           | 
           | You can explore other open and closed issues and
           | corresponding discussions. And this is the complexity level
           | of real software, not pet projects or simple apps.
           | 
           | I guess that existing attention mechanism is the fundamental
           | blocker, because it barely able to process all the context
           | required for a fix.
           | 
           | And feature requests a much, much more complex.
        
         | ghurtado wrote:
         | All the things you describe are already being done by any team
         | with a modern CI/CD workflow, and none of it requires AI.
         | 
         | At my last job, all of those steps were automated and required
         | exactly zero human input.
        
           | denidoman wrote:
           | Are you sure about "all"? Because I mentioned not only env
           | deployment, but also functional issue reproduction using
           | UI/API, which is also require necessary pre-setup.
           | 
           | Automated tests partially solve the case, but in real world
           | no one writes tests blindly. It's always manual work, and
           | when the failing trajectory is clear - the test is written.
           | 
           | Theoretically agent can interact with UI or API. But it
           | requires deep project understanding, gathered from code,
           | documentation, git history, tickets, slack. And obtaining
           | this context, building an easily accessible knowledge base
           | and puring only necessary parts into the agent context - is
           | still a not solved task.
        
       | lherron wrote:
       | I love this! I have a similar automation for moving a feature
       | through ideation/requirements/technical design, but I usually
       | dump the result into Cursor for last mile and to save on
       | inference. Seeing the cost analysis is eye opening.
       | 
       | There's probably also some upside to running the same model
       | multiple times. I find Sonnet will sometimes fail, I'll roll back
       | and try again with same prompt but clean context, and it will
       | succeed.
        
         | ghuntley wrote:
         | re: cost analysis
         | 
         | There's something cooked about Windsurf/Cursors' go-to-market
         | pricing - there's no way they are turning a profit at
         | $50/month. $50/month gets you a happy meal experience. If you
         | want more power, you gotta ditch snacking at McDonald's.
         | 
         | In the future, companies should budget $100 USD to $500 USD per
         | day, per dev, on tokens as the new normal for business, which
         | is circa $25k USD (low end) to $50k USD (likely) to $127k USD
         | (highest) per year.
         | 
         | Above from https://ghuntley.com/redlining/
         | 
         | This napkin math is based upon my current spend in bring a
         | self-compiled compiler to life.
        
       | precompute wrote:
       | Feels like a way to live with a bad decision rather than getting
       | rid of it.
        
       | pton_xd wrote:
       | The trend with LLMs so far has been: if you have an issue with
       | the AI, wait 6 months for a more advanced model. Cobbling
       | together workarounds for their deficiencies is basically a waste
       | of effort.
        
       | KTibow wrote:
       | I wonder if using thinking models would work better here. They
       | generally have less variance and consider more options, which
       | could achieve the same goal.
        
       | canterburry wrote:
       | I wouldn't be surprised if someone tries to leverage this with
       | their customer feature request tool.
       | 
       | Imagine having your customers write feature requests for your
       | saas, that immediately triggers code generation and a PR. A
       | virtual environment with that PR is spun up and served to that
       | customer for feedback and refinement. Loop until customer has
       | implemented the feature they would like to see in your product.
       | 
       | Enterprise plan only, obviously.
        
       | kgeist wrote:
       | I've noticed that large models from different vendors often end
       | up converging on more or less the same ideas (probably because
       | they're trained on more or less the same data). A few days ago, I
       | asked both Grok and ChatGPT to produce several stories with an
       | absurd twist, and they consistently generated the same twists,
       | differing only in minor details. Often, they even used identical
       | wording!
       | 
       | Is there any research into this phenomenon? Is code generation
       | any different? Isn't there a chance that several "independent"
       | models might produce the same (say, faulty) result?
        
       ___________________________________________________________________
       (page generated 2025-04-13 23:00 UTC)