[HN Gopher] Effective harnesses for long-running agents
___________________________________________________________________
Effective harnesses for long-running agents
Author : diwank
Score : 48 points
Date : 2025-11-28 19:05 UTC (3 hours ago)
(HTM) web link (www.anthropic.com)
(TXT) w3m dump (www.anthropic.com)
| dangoodmanUT wrote:
| > ... the model is less likely to inappropriately change or
| overwrite JSON files compared to Markdown files.
|
| Very interesting.
| roughly wrote:
| One of the things that makes it very difficult to have reasonable
| conversations about what you can do with LLMs is the effort-to-
| outcome curve is basically exponential - with almost no effort,
| you can get 70% of the way there. This looks amazing, and so
| people (mostly executives) look at this and think, "this changes
| everything!"
|
| The problem is the remaining 30% - the next 10-20% starts to
| require things like multi-agent judge setups, external memory,
| context management, and that gets you to something that's
| probably working but you sure shouldn't ship to production. As to
| the last 10% - I've seen agentic workflows with hundreds of
| different agents, multiple models, and fantastically complex
| evaluation frameworks to try to reduce the error rates past the
| ~10% mark. By a certain point, the amount of infrastructure and
| LLM calls are running into several hundred dollars per run, and
| you're still not getting guaranteed reliable output.
|
| If you know what you're doing and you know where to fit the LLMs
| (they're genuinely the best system we've ever devised for
| interpreting and categorizing unstructured human input), they can
| be immensely useful, but they sing a siren song of simplicity
| that will lure you to your doom if you believe it.
| morkalork wrote:
| Just for getting a frame of reference, how many people were
| involved over how much time building a workflow with hundreds
| of agents?
| roughly wrote:
| I've seen a couple solo efforts and a couple teams, but
| usually a few months. It tends to evolve as a kind of whack-
| a-mole situation - "we solved that failure
| case/hallucination, now we're getting this one."
| zephyrthenoble wrote:
| Yes, it's essentially the Pareto principle [0]. The LLM
| community has conflated the 80% as difficult complicated work,
| when it was essentially boilerplate. Allegedly LLMs have saved
| us from that drudgery, but I personally have found that
| (without the complicated setups you mention) the 80% done
| project that gets one shot is in reality more like 50% done
| because it is built on an unstable foundation, and that final
| 20% involves a lot of complicated reworking of the code.
| There's still plenty of value but I think it is less than
| proponents would want you to believe.
|
| Anecdotally, I have found that even if you type out paragraph
| after paragraph describing everything you need the agent to
| take care of, it eventually feels like you could have written a
| lot of the code yourself with the help of a good IDE by the
| time you can finally send your prompt off.
|
| - [0] https://en.wikipedia.org/wiki/Pareto_principle
| roughly wrote:
| Yeah, my mental model at this point is there's two components
| to building a system: writing the code and understanding the
| system. When you're the one writing the code, you get the
| understanding at the same time. When you're not, you still
| need to put in that work to deeply grok the system. You can
| do it ahead of time while writing the prompts, you can do it
| while reviewing the code, you can do it while writing the
| test suite, or you can do it when the system is on fire
| during an outage, but the work to understand the system can't
| be outsourced to the LLM.
| slurrpurr wrote:
| BDSM for LLMs
| CurleighBraces wrote:
| I wonder how good these agents would be using something like
| cucumber and behaviour driven development tools?
| _boffin_ wrote:
| ...it really feels like they're attempting to reinvent a project
| tracker and starting off from scratch in thinking about it.
|
| It feels like they're a few versions behind what I'm doing, which
| is... odd.
|
| Self-hosting a plane.io instance. Added a plane MCP tool to my
| codex. Added workflow instructions into Agents.md which cover
| standards, documentation, related work, labels, branch names,
| adding of comments before plan, after plan, at varying steps of
| implementation, summary before moving ticket to done. Creating
| new tickers and being able to relate to current or others, etc...
|
| It ain't that hard. Just do inception (high to mid level details)
| create epics and tasks. Add personas, details, notes, acceptance
| criteria and more. Can add comments yourself to update. Whatever.
|
| Slice tickets thin and then go wild. Add tickets as your working
| though things. Make modifications.
|
| Why so difficult?
| tomwojcik wrote:
| Did you mean plane.so instead of plane.io?
| imron wrote:
| I assumed they meant https://github.com/makeplane/plane
| _boffin_ wrote:
| Correct. My bad
| imron wrote:
| This was my take.
|
| They've made an issue tracker out of json files and a text
| file.
|
| Why not hook an mcp to an actual issue tracker?
| daxfohl wrote:
| IME a dedicated testing / QA agent sounds nice but doesn't work,
| for same reasons as AI / human interaction. The more you try to
| diverge from the original dev agent's approach, the less and less
| chance there is that the dev agent will get to where you want it
| to be. Far more frequently it'll get stuck in a loop between two
| options that are both not what you want.
|
| So adding a QA agent, while it sounds logical, just ends up being
| even more of this. Rather than converging on a solution, they
| just get all out of whack. Until that is solved, far better just
| to have your dev agent be smart about doing its own QA.
|
| The only way I could see the QA agent idea working now is if it
| had the power to roll back the entire change, reset the dev
| agent, update the task with some hints of things not to overlook,
| and trigger the dev process from scratch. But that seems pretty
| inefficient, and IDK if it would work any better.
| awayto wrote:
| > Run pwd to see the directory you're working in. You'll only be
| able to edit files in this directory.
|
| If you're using the agent to produce any kind of code that has
| access to manipulate the filesystem, may as well have it
| understand its own abilities as having the entirety of CRUD, not
| just updates. I could easily see the agent talking itself into
| working around "only be able to edit" with its other knowledge
| that it can just write a script to do whatever it wants. This
| also reinforces to devs that they basically shouldn't trust the
| agent when it comes to the filesystem.
|
| As for pwd for existing projects, I start each session running
| tree local to the part of the project filesystem I want to have
| worked on.
| rancar2 wrote:
| Having done this for Gemini CLI to get it to behave well several
| months ago to have a non-coding LLM CLI without costs, I can
| attest that these tips work well across CLIs.
___________________________________________________________________
(page generated 2025-11-28 23:00 UTC)