[HN Gopher] Effective harnesses for long-running agents
       ___________________________________________________________________
        
       Effective harnesses for long-running agents
        
       Author : diwank
       Score  : 48 points
       Date   : 2025-11-28 19:05 UTC (3 hours ago)
        
 (HTM) web link (www.anthropic.com)
 (TXT) w3m dump (www.anthropic.com)
        
       | dangoodmanUT wrote:
       | > ... the model is less likely to inappropriately change or
       | overwrite JSON files compared to Markdown files.
       | 
       | Very interesting.
        
       | roughly wrote:
       | One of the things that makes it very difficult to have reasonable
       | conversations about what you can do with LLMs is the effort-to-
       | outcome curve is basically exponential - with almost no effort,
       | you can get 70% of the way there. This looks amazing, and so
       | people (mostly executives) look at this and think, "this changes
       | everything!"
       | 
       | The problem is the remaining 30% - the next 10-20% starts to
       | require things like multi-agent judge setups, external memory,
       | context management, and that gets you to something that's
       | probably working but you sure shouldn't ship to production. As to
       | the last 10% - I've seen agentic workflows with hundreds of
       | different agents, multiple models, and fantastically complex
       | evaluation frameworks to try to reduce the error rates past the
       | ~10% mark. By a certain point, the amount of infrastructure and
       | LLM calls are running into several hundred dollars per run, and
       | you're still not getting guaranteed reliable output.
       | 
       | If you know what you're doing and you know where to fit the LLMs
       | (they're genuinely the best system we've ever devised for
       | interpreting and categorizing unstructured human input), they can
       | be immensely useful, but they sing a siren song of simplicity
       | that will lure you to your doom if you believe it.
        
         | morkalork wrote:
         | Just for getting a frame of reference, how many people were
         | involved over how much time building a workflow with hundreds
         | of agents?
        
           | roughly wrote:
           | I've seen a couple solo efforts and a couple teams, but
           | usually a few months. It tends to evolve as a kind of whack-
           | a-mole situation - "we solved that failure
           | case/hallucination, now we're getting this one."
        
         | zephyrthenoble wrote:
         | Yes, it's essentially the Pareto principle [0]. The LLM
         | community has conflated the 80% as difficult complicated work,
         | when it was essentially boilerplate. Allegedly LLMs have saved
         | us from that drudgery, but I personally have found that
         | (without the complicated setups you mention) the 80% done
         | project that gets one shot is in reality more like 50% done
         | because it is built on an unstable foundation, and that final
         | 20% involves a lot of complicated reworking of the code.
         | There's still plenty of value but I think it is less than
         | proponents would want you to believe.
         | 
         | Anecdotally, I have found that even if you type out paragraph
         | after paragraph describing everything you need the agent to
         | take care of, it eventually feels like you could have written a
         | lot of the code yourself with the help of a good IDE by the
         | time you can finally send your prompt off.
         | 
         | - [0] https://en.wikipedia.org/wiki/Pareto_principle
        
           | roughly wrote:
           | Yeah, my mental model at this point is there's two components
           | to building a system: writing the code and understanding the
           | system. When you're the one writing the code, you get the
           | understanding at the same time. When you're not, you still
           | need to put in that work to deeply grok the system. You can
           | do it ahead of time while writing the prompts, you can do it
           | while reviewing the code, you can do it while writing the
           | test suite, or you can do it when the system is on fire
           | during an outage, but the work to understand the system can't
           | be outsourced to the LLM.
        
       | slurrpurr wrote:
       | BDSM for LLMs
        
       | CurleighBraces wrote:
       | I wonder how good these agents would be using something like
       | cucumber and behaviour driven development tools?
        
       | _boffin_ wrote:
       | ...it really feels like they're attempting to reinvent a project
       | tracker and starting off from scratch in thinking about it.
       | 
       | It feels like they're a few versions behind what I'm doing, which
       | is... odd.
       | 
       | Self-hosting a plane.io instance. Added a plane MCP tool to my
       | codex. Added workflow instructions into Agents.md which cover
       | standards, documentation, related work, labels, branch names,
       | adding of comments before plan, after plan, at varying steps of
       | implementation, summary before moving ticket to done. Creating
       | new tickers and being able to relate to current or others, etc...
       | 
       | It ain't that hard. Just do inception (high to mid level details)
       | create epics and tasks. Add personas, details, notes, acceptance
       | criteria and more. Can add comments yourself to update. Whatever.
       | 
       | Slice tickets thin and then go wild. Add tickets as your working
       | though things. Make modifications.
       | 
       | Why so difficult?
        
         | tomwojcik wrote:
         | Did you mean plane.so instead of plane.io?
        
           | imron wrote:
           | I assumed they meant https://github.com/makeplane/plane
        
             | _boffin_ wrote:
             | Correct. My bad
        
         | imron wrote:
         | This was my take.
         | 
         | They've made an issue tracker out of json files and a text
         | file.
         | 
         | Why not hook an mcp to an actual issue tracker?
        
       | daxfohl wrote:
       | IME a dedicated testing / QA agent sounds nice but doesn't work,
       | for same reasons as AI / human interaction. The more you try to
       | diverge from the original dev agent's approach, the less and less
       | chance there is that the dev agent will get to where you want it
       | to be. Far more frequently it'll get stuck in a loop between two
       | options that are both not what you want.
       | 
       | So adding a QA agent, while it sounds logical, just ends up being
       | even more of this. Rather than converging on a solution, they
       | just get all out of whack. Until that is solved, far better just
       | to have your dev agent be smart about doing its own QA.
       | 
       | The only way I could see the QA agent idea working now is if it
       | had the power to roll back the entire change, reset the dev
       | agent, update the task with some hints of things not to overlook,
       | and trigger the dev process from scratch. But that seems pretty
       | inefficient, and IDK if it would work any better.
        
       | awayto wrote:
       | > Run pwd to see the directory you're working in. You'll only be
       | able to edit files in this directory.
       | 
       | If you're using the agent to produce any kind of code that has
       | access to manipulate the filesystem, may as well have it
       | understand its own abilities as having the entirety of CRUD, not
       | just updates. I could easily see the agent talking itself into
       | working around "only be able to edit" with its other knowledge
       | that it can just write a script to do whatever it wants. This
       | also reinforces to devs that they basically shouldn't trust the
       | agent when it comes to the filesystem.
       | 
       | As for pwd for existing projects, I start each session running
       | tree local to the part of the project filesystem I want to have
       | worked on.
        
       | rancar2 wrote:
       | Having done this for Gemini CLI to get it to behave well several
       | months ago to have a non-coding LLM CLI without costs, I can
       | attest that these tips work well across CLIs.
        
       ___________________________________________________________________
       (page generated 2025-11-28 23:00 UTC)