[HN Gopher] The unreasonable effectiveness of an LLM agent loop ...
___________________________________________________________________
The unreasonable effectiveness of an LLM agent loop with tool use
Author : crawshaw
Score : 135 points
Date : 2025-05-15 19:33 UTC (3 hours ago)
(HTM) web link (sketch.dev)
(TXT) w3m dump (sketch.dev)
| _bin_ wrote:
| I've found sonnet-3.7 to be incredibly inconsistent. It can do
| very well but has a strong tendency to get off-track and run off
| and do weird things.
|
| 3.5 is better for this, ime. I hooked claude desktop up to an MCP
| server to fake claude-code less the extortionate pricing and it
| works decently. I've been trying to apply it for rust work; it's
| not great yet (still doesn't really seem to "understand" rust's
| concepts) but can do some stuff if you make it `cargo check`
| after each change and stop it if it doesn't.
|
| I expect something like o3-high is the best out there (aider
| leaderboards support this) either alone or in combination with
| 4.1, but tbh that's out of my price range. And frankly, I can't
| mentally get past paying a very high price for an LLM response
| that may or may not be useful; it leaves me incredibly resentful
| as a customer that your model can fail the task, requiring
| multiple "re-rolls", and you're passing that marginal cost to me.
| agilebyte wrote:
| I am avoiding the cost of API access by using the chat/ui
| instead, in my case Google Gemini 2.5 Pro with the high token
| window. Repomix a whole repo. Paste it in with a standard
| prompt saying "return full source" (it tends to not follow this
| instruction after a few back and forths) and then apply the
| result back on top of the repo (vibe coded
| https://github.com/radekstepan/apply-llm-changes to help me
| with that). Else yeah, $5 spent on Cline with Claude 3.7 and
| instead of fixing my tests, I end up with if/else statements in
| the source code to make the tests pass.
| harvey9 wrote:
| Guess it was trained by scraping thedailywtf.com
| actsasbuffoon wrote:
| I decided to experiment with Claude Code this month. The
| other day it decided the best way to fix the spec was to add
| a conditional to the test that causes it to return true
| before getting to the thing that was actually supposed to be
| tested.
|
| I'm finding it useful for really tedious stuff like doing
| complex, multi step terminal operations. For the coding...
| it's not been great.
| nico wrote:
| I've had this in different ways many times. Like instead of
| resolving the underlying issue for an exception, it just
| suggests catching the exception and keep going
|
| It also depends a lot on the mix of model and type of code
| and libraries involved. Even in different days the models
| seem to be more or less capable (I'm assuming they get
| throttled internally - this is very noticeable sometimes in
| how they try to save on output tokens and summarize the
| code responses as much as possible, at least in the
| chat/non-api interfaces)
| christophilus wrote:
| Well, that's proof that it used my GitHub projects in its
| training data.
| nico wrote:
| Cool tool. What format does it expect from the model?
|
| I've been looking for something that can take "bare diffs"
| (unified diffs without line numbers), from the clipboard and
| then apply them directly on a buffer (an open file in vscode)
|
| None of the paste diff extension for vscode work, as they
| expect a full unified diff/patch
|
| I also tried a google-developed patch tool, but also wasn't
| very good at taking in the bare diffs, and def couldn't do
| clipboard
| agilebyte wrote:
| Markdown format with a comment saying what the file path
| is. So:
|
| This is src/components/Foo.tsx
|
| ```tsx // code goes here ```
|
| OR
|
| ```tsx // src/components/Foo.tsx // code goes here ```
|
| These seem to work the best.
|
| I tried diff syntax, but Gemini 2.5 just produced way too
| many bugs.
|
| I also tried using regex and creating an AST of the
| markdown doc and going from there, but ultimately settled
| on calling gpt-4.1-mini-2025-04-14 with the beginning of
| the code block (```) and 3 lines before and 3 lines after
| the beginning of the code block. It's fast/cheap enough to
| work.
|
| Though I still have to make edits sometimes. WIP.
| layoric wrote:
| I've been using Mistral Medium 3 last couple of days, and I'm
| honestly surprised at how good it is. Highly recommend giving
| it a try if you haven't, especially if you are trying to reduce
| costs. I've basically switched from Claude to Mistral and
| honestly prefer it even if costs were equal.
| nico wrote:
| How are you running the model? Mistral's api or some local
| version through ollama, or something else?
| kyleee wrote:
| Is mistral on open router?
| nico wrote:
| Yup https://openrouter.ai/provider/mistral
|
| I guess it can't really be run locally https://www.reddit
| .com/r/LocalLLaMA/comments/1kgyfif/introdu...
| johnsmith1840 wrote:
| I seem to be alone in this but the only methods truly good at
| coding are slow heavy test time compute models.
|
| o1-pro and o1-preview are the only models I've ever used that
| can reliably update and work with 1000 LOC without error.
|
| I don't let o3 write any code unless it's very small. Any
| "cheap" model will hallucinate or fail massively when pushed.
|
| One good tip I've done lately. Remove all comments in your code
| before passing or using LLMs, don't let LLM generated comments
| persist under any circumstance.
| _bin_ wrote:
| Interesting. I've never tested o1-pro because it's insanely
| expensive but preview seemed to do okay.
|
| I wouldn't be shocked if huge, expensive-to-run models
| performed better and if all the "optimized" versions were
| actually labs trying to ram cheaper bullshit down everyone's
| throat. Basically chinesium for LLMs; you can afford them but
| it's not worth it. I remember someone saying o1 was, what,
| 200B dense? I might be misremembering.
| johnsmith1840 wrote:
| I'm positive they are pushing users to cheaper models due
| to cost. o1-pro is now in a sub menu for pro users and
| labled legacy. The big inference methods must be stupidly
| expensive.
|
| o1-preview was and possibly still is the most powerful
| model they ever released. I only switched to pro for coding
| after months of them improving it and my api bill getting a
| bit crazy (like 0.50$ per question).
|
| I don't think paramater count matters anymore. I think the
| only thing that matters is how much compute a vendor will
| give you per question.
| jbellis wrote:
| Yes!
|
| Han Xiao at Jina wrote a great article that goes into a lot more
| detail on how to turn this into a production quality agentic
| search: https://jina.ai/news/a-practical-guide-to-implementing-
| deeps...
|
| This is the same principle that we use at Brokk for Search and
| for Architect. (https://brokk.ai/)
|
| The biggest caveat: some models just suck at tool calling, even
| "smart" models like o3. I only really recommend Gemini Pro 2.5
| for Architect (smart + good tool calls); Search doesn't require
| as high a degree of intelligence and lots of models work (Sonnet
| 3.7, gpt-4.1, Grok 3 are all fine).
| crawshaw wrote:
| I'm curious about your experiences with Gemini Pro 2.5 tool
| calling. I have tried using it in agent loops (in fact, sketch
| has some rudimentary support I added), and compared with the
| Anthropic models I have had to actively reprompt Gemini
| regularly to make tool calls. Do you consider it equivalent to
| Sonnet 3.7? Has it required some prompt engineering?
| jbellis wrote:
| Confession time: litellm still doesn't support parallel tool
| calls with Gemini models
| [https://github.com/BerriAI/litellm/issues/9686] so we wrote
| our own "parallel tool calls" on top of Structured Output. It
| did take a few iterations on the prompt design but all of it
| was "yeah I can see why that was ambiguous" kinds of things,
| no real complaints.
|
| GP2.5 does have a different flavor than S3.7 but it's hard to
| say that one is better or worse than the other [edit: at tool
| calling -- GP2.5 is definitely smarter in general]. GP2.5 is
| I would say a bit more aggressive at doing "speculative" tool
| execution in parallel with the architect, e.g. spawning
| multiple search agent calls at the same time, which for Brokk
| is generally a good thing but I could see use cases where
| you'd want to dial that back.
| kgeist wrote:
| Today I tried "vibe-coding" for the first time using GPT-4o and
| 4.1. I did it manually - just feeding compilation errors,
| warnings, and suggestions in a loop via the canvas interface. The
| file was small, around 150 lines.
|
| It didn't go well. I started with 4o:
|
| - It used a deprecated package.
|
| - After I pointed that out, it didn't update all usages - so I
| had to fix them manually.
|
| - When I suggested a small logic change, it completely broke the
| syntax (we're talking "foo() } return )))" kind of broken) and
| never recovered. I gave it the raw compilation errors over and
| over again, but it didn't even register the syntax was off - just
| rewrote random parts of the code instead.
|
| - Then I thought, "maybe 4.1 will be better at coding" (as
| advertized). But 4.1 refused to use the canvas at all. It just
| explained what I could change - as in, you go make the edits.
|
| - After some pushing, I got it to use the canvas and return the
| full code. Except it didn't - it gave me a truncated version of
| the code with comments like "// omitted for brevity".
|
| That's when I gave up.
|
| Do agents somehow fix this? Because as it stands, the experience
| feels completely broken. I can't imagine giving this access to
| bash, sounds way too dangerous.
| vFunct wrote:
| Use Claude Sonnet with an IDE.
| 85392_school wrote:
| Agents definitely fix this. When you can run commands and edit
| files, the agent can test its code by itself and fix any
| issues.
| visarga wrote:
| You should try Cursor or Windsurf, with Claude or Gemini model.
| Create a documentation file first. Generate tests for
| everything. The more the better. Then let it cycle 100 times
| until tests pass.
|
| Normal programming is like walking, deliberate and sure. Vibe
| coding is like surfing, you can't control everything, just hit
| yes on auto. Trust the process, let it make mistakes and
| recover on its own.
| prisenco wrote:
| Given that analogy, surely you could understand why someone
| would much rather walk than surf to their destination?
| Especially people who are experienced marathon runners.
| fragmede wrote:
| If I tried standing up on the waves without a surfboard,
| and complain about how it's not working, would you blame
| the water or surfing for the issue, or the person trying to
| defy physics, complaining that it's not working? It doesn't
| matter how much I want to run or if I'm Kelvin Kiptum, I'm
| gonna have a bad time.
| prisenco wrote:
| That only makes sense when surfing is the only way to get
| to the destination and that's not the case.
| tqwhite wrote:
| I find that writing a thorough design spec is really worth
| it. Also, asking for its reaction. "What's missing?" "Should
| I do X or Y" does good things for its thought process, like
| engaging a younger programmer in the process.
|
| Definitely, I ask for a plan and then, even if it's obvious,
| I ask questions and discuss it. I also point it as samples of
| code that I like with instructions for what is good about it.
|
| Once we have settled on a plan, I ask it to break it into
| phases that can be tested (I am not one for a unit testing)
| to lock in progress. Claude LOVES that. It organizes a new
| plan and, at the end of each phase tells me how to test
| (curl, command line, whatever is appropriate) and what I
| should see that represents success.
|
| The most important thing I have figured out is that Claude is
| a collaborator, not a minion. I agree with visarga, it's much
| more like surfing that walking. Also, Trust... but Verify.
|
| This is a great time to be a programmer.
| abiraja wrote:
| GPT4o and 4.1 are definitely not the best models to use here.
| Use Claude 3.5/3.7, Gemini Pro 2.5 or o3. All of them work
| really well for small files.
| theropost wrote:
| 150 lines? I find can quickly scale to around 1500 lines, and
| then start more precision on the classes, and functions I am
| looking to modify
| jokethrowaway wrote:
| It's completely broken for me over 400 lines (Claude 3.7,
| paid Cursor)
|
| The worst is when I ask something complex, the model
| generates 300 lines of good code and then timeouts or
| crashes. If I ask to continue it will mess up the code for
| good, eg. starts generating duplicated code or functions
| which don't match the rest of the code.
| johnsmith1840 wrote:
| It's a new skill that takes time to learn. When I started
| on gpt3.5 it took me easily 6 months of daily use before I
| was making real progress with it.
|
| I regularly generate and run in the 600-1000LOC range.
|
| Not sure you would call it "vibe coding" though as the
| details and info you provide it and how you provide it is
| not simple.
|
| I'd say realistically it speeds me up 10x on fresh
| greenfield projects and maybe 2x on mature systems.
|
| You should be reading the code coming out. The real way to
| prevent errors is read the resoning and logic. The moment
| you see a mistep go back and try the prompt again. If that
| fails try a new session entirely.
|
| Test time compute models like o1-pro or the older
| o1-preview are massively better at not putting errors in
| your code.
|
| Not sure about the new claude method but true, slow test
| time models are MASSIVELY better at coding.
| fragmede wrote:
| what language?
| koakuma-chan wrote:
| Sounds like a Cursor issue
| tqwhite wrote:
| Definitely a new skill to learn. Everyone I know that is
| having problems is just telling it what to do, not coaching
| it. It is not an automaton... instructions in code out.
| Treat it like a team member that will do the work if you
| teach it right and you will have much more success.
|
| But is definitely a learning process for you.
| nico wrote:
| 4o and 4.1 are not very good at coding
|
| My best results are usually with 4o-mini-high, o3 is sometimes
| pretty good
|
| I personally don't like the canvas. I prefer the output on the
| chat
|
| And a lot of times I say: provide full code for this file, or
| provide drop-in replacement (when I don't want to deal with all
| the diffs). But usually at around 300-400 lines of code, it
| starts getting bad and then I need to refactor to break stuff
| up into multiple files (unless I can focus on just one method
| inside a file)
| manmal wrote:
| o3 is shockingly good actually. I can't use it often due to
| rate limiting, so I save it for the odd occasion. Today I
| asked it how I could integrate a tree of Swift binary
| packages within an SDK, and detect internal version clashes,
| and it gave a very well researched and sensible overview. And
| gave me a new idea that I'll try.
| kenjackson wrote:
| I use o3 for anything math or coding related. 4o is good
| for things like, "my knee hurts when I do this and that --
| what might it be?"
| hnhn34 wrote:
| Just in case you didn't know, they raised the rate limit
| from ~50/week to ~50/day a while ago
| johnsmith1840 wrote:
| Drop in replacement files per update should be done on the
| heavy test time compute methods.
|
| o1-pro, o1-preview can generate updated full file responses
| into the 1k LOC range.
|
| It's something about their internal verification methods that
| make it an actual viable development method.
| nico wrote:
| True. Also, the APIs don't care too much about restricting
| output length, they might actually be more verbose to
| charge more
|
| It's interesting how the same model being served through
| different interfaces (chat vs api), can behave differently
| based on the economic incentives of the providers
| smcleod wrote:
| GPT 4o and 4.1 are both pretty terrible for coding to be
| honest, try Sonnet 3.7 in Cline (VSCode extension).
|
| LLMs don't have up to date knowledge of packages by themselves
| that's a bit like buying a book and expecting it to have up to
| date world knowledge, you need to supplement / connect it to a
| data source (e.g. web search, documentation and package version
| search etc.).
| hollownobody wrote:
| Try o3 please. Via UI.
| koakuma-chan wrote:
| You gotta use a reasoning model.
| Jarwain wrote:
| Aider's benchmarks show 4.1 (and 4o) work better in its
| architect mode, for planning the changes, and o3 for making the
| actual edits
| thorum wrote:
| GPT 4.1 and 4o score very low on the Aider coding benchmark.
| You only start to get acceptable results with models that score
| 70%+ in my experience. Even then, don't expect it to do
| anything complex without a lot of hand-holding. You start to
| get a sense for what works and what doesn't.
|
| https://aider.chat/docs/leaderboards/
| ebiester wrote:
| I get that it's frustrating to be told "skill issue," but using
| an LLM is absolutely a skill and there's a combination of
| understanding the strengths of various tools, experimenting
| with them to understand the techniques, and just pure practice.
|
| I think if I were giving access to bash, though, it would
| definitely be in a docker container for me as well.
| fsndz wrote:
| I can be frustrating at times. but my experience is the more
| you try the better you become at knowing what to ask and to
| expect. But I guess you understand now why some people say vibe
| coding is a bit overrated: https://www.lycee.ai/blog/why-vibe-
| coding-is-overrated
| the_af wrote:
| "Overrated" is one way to call it.
|
| Giving sharp knives to monkeys would be another.
| simonw wrote:
| "It used a deprecated package"
|
| That's because models have training cut-off dates. It's
| important to take those into account when working with them:
| https://simonwillison.net/2025/Mar/11/using-llms-for-code/#a...
|
| I've switched to o4-mini-high via ChatGPT as my default model
| for a lot of code because it can use its search function to
| lookup the latest documentation.
|
| You can tell it "look up the most recent version of library X
| and use that" and it will often work!
|
| I even used it for a frustrating upgrade recently - I pasted in
| some previous code and prompted this:
|
| _This code needs to be upgraded to the new recommended
| JavaScript library from Google. Figure out what that is and
| then look up enough documentation to port this code to it._
|
| It did exactly what I asked:
| https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...
| kgeist wrote:
| >That's because models have training cut-off dates
|
| When I pointed out that it used a deprecated package, it
| agreed and even cited the correct version after which it was
| deprecated (way back in 2021). So it knows it's deprecated,
| but the next-token prediction (without reasoning or tools)
| still can't connect the dots when much of the training data
| (before 2021) uses that package as if it's still acceptable.
|
| >I've switched to o4-mini-high via ChatGPT as my default
| model for a lot of code because it can use its search
| function to lookup the latest documentation.
|
| Thanks for the tip!
| fragmede wrote:
| There's still skill involved with using the LLM in coding.
| In this case, o4-mini-high might do the trick, but the
| easier answer that worry's with other models is to include
| the high level library documentation yourself as context
| and it'll use that API.
| jmcpheron wrote:
| >I've switched to o4-mini-high via ChatGPT as my default
| model for a lot of code because it can use its search
| function to lookup the latest documentation.
|
| That is such a useful distinction. I like to think I'm
| keeping up with this stuff, but the '4o' versus 'o4' still
| throws me.
| LewisVerstappen wrote:
| skill issue.
|
| The fact that you're using 4o and 4.1 rather than claude is
| already a huge mistake in itself.
|
| > Because as it stands, the experience feels completely broken
|
| Broken for you. Not for everyone else.
| codethief wrote:
| The other day I used the Cline plugin for VSCode with Claude to
| create an Android app prototype from "scratch", i.e. starting
| from the usual template given to you by Android Studio. It
| produced several thousand lines of code, there was not a single
| compilation error, and the app ended up doing exactly what I
| wanted - modulo a bug or two, which were caused not by the
| LLM's stupidity but by weird undocumented behavior of the
| rather arcane Android API in question. (Which is exactly why I
| wanted a quick prototype.)
|
| After pointing out the bugs to the LLM, it successfully
| debugged them (with my help/feedback, i.e. I provided the
| output of the debug messages it had added to the code) and
| ultimately fixed them. The only downside was that I wasn't
| quite happy with the quality of the fixes - they were more like
| dirty hacks -, but oh well, after another round or two of
| feedback we got there, too. I'm sure one could solve that more
| generally, by putting the agent writing the code in a loop with
| some other code reviewing agent.
| fragmede wrote:
| In this case, sorry to say but it sounds like there's a tooling
| issue, possibly also a skill issue. Of course you can just use
| the raw ChatGPT web interface but unless you seriously tune its
| system/user prompt, it's not going to match what good tooling
| (which sets custom prompts) will get you. Which is kind of
| counter-intuitive. A paragraph or three fed in as the system
| prompt is enough to influence behavior/performance so
| significantly? It turns out with LLMs the answer is yes.
| danbmil99 wrote:
| As others have noted, you sound about 3 months behind the
| leading edge. What you describe is like my experience from
| February.
|
| Switch to Claude (IMSHO, I think Gemini is considered on par).
| Use a proper coding tool, cutting & pasting from the chat
| window is so last week.
| libraryofbabel wrote:
| Strongly recommend this blog post too which is a much more
| detailed and persuasive version of the same point. The author
| actually goes and builds a coding agent from zero:
| https://ampcode.com/how-to-build-an-agent
|
| It is indeed astonishing how well a loop with an LLM that can
| call tools works for all kinds of tasks now. Yes, sometimes they
| go off the rails, there is the problem of getting that last 10%
| of reliability, etc. etc., but if you're not at least a little
| bit amazed then I urge you go to and hack together something like
| this yourself, which will take you about 30 minutes. It's
| possible to have a sense of wonder about these things without
| giving up your healthy skepticism of whether AI is actually going
| to be effective for this or that use case.
|
| This "unreasonable effectiveness" of putting the LLM in a loop
| also accounts for the enormous proliferation of coding agents out
| there now: Claude Code, Windsurf, Cursor, Cline, Copilot, Aider,
| Codex... and a ton of also-rans; as one HN poster put it the
| other day, it seems like everyone and their mother is writing
| one. The reason is that there is no secret sauce and 95% of the
| magic is in the LLM itself and how it's been fine-tuned to do
| tool calls. One of the lead developers of Claude Code candidly
| admits this in a recent interview.[0] Of course, a ton of work
| goes into making these tools work well, but ultimately they all
| have the same simple core.
|
| [0] https://www.youtube.com/watch?v=zDmW5hJPsvQ
| sesm wrote:
| Should we change the link above to use
| `?utm_source=hn&utm_medium=browser` before opening it?
| libraryofbabel wrote:
| fixed :)
| kcorbitt wrote:
| For "that last 10% of reliability" RL is actually working
| pretty well right now too! https://openpipe.ai/blog/art-e-mail-
| agent
| meander_water wrote:
| There's also this one which uses pocketflow, a graph
| abstraction library to create something similar [0]. I've been
| using it myself and love the simplicity of it.
|
| [0] https://github.com/The-Pocket/PocketFlow-Tutorial-
| Cursor/blo...
| wepple wrote:
| Ah, it's Thorsten Ball!
|
| I thoroughly enjoyed his "writing an interpreter". I guess I'm
| going to build an agent now.
| bhouston wrote:
| I found this out too - it is quite easy and effective:
|
| https://benhouston3d.com/blog/building-an-agentic-code-from-...
| simonw wrote:
| I'm _very_ excited about tool use for LLMs at the moment.
|
| The trick isn't new - I first encountered it with the ReAcT paper
| two years ago - https://til.simonwillison.net/llms/python-react-
| pattern - and it's since been used for ChatGPT plugins, and
| recently for MCP, and all of the models have been trained with
| tool use / function calls in mind.
|
| What's interesting today is how GOOD the models have got at it.
| o3/o4-mini's amazing search performance is all down to tool
| calling. Even Qwen3 4B (2.6GB from Ollama, runs happily on my
| Mac) can do tool calling reasonably well now.
|
| I gave a workshop at PyCon US yesterday about building software
| on top of LLMs - https://simonwillison.net/2025/May/15/building-
| on-llms/ - and used that as an excuse to finally add tool usage
| to an alpha version of my LLM command-line tool. Here's the
| section of the workshop that covered that:
|
| https://building-with-llms-pycon-2025.readthedocs.io/en/late...
|
| My LLM package can now reliably count the Rs in strawberry as a
| shell one-liner: llm --functions ' def
| count_char_in_string(char: str, string: str) -> int:
| """Count the number of times a character appears in a string."""
| return string.lower().count(char.lower()) ' 'Count the
| number of Rs in the word strawberry' --td
| andrewmcwatters wrote:
| I love the odd combination of silliness and power in this.
| DarmokJalad1701 wrote:
| Was the workshop recorded?
| mukesh610 wrote:
| I built this very same thing today! The only difference is that i
| pushed the tool call outputs into the conversation history and
| resent it back to the LLM for it to summarize, or perform further
| tool calls, if necessary, automagically.
|
| I used ollama to build this and ollama supports tool calling
| natively, by passing a `tools=[...]` in the Python SDK. The tools
| can be regular Python functions with docstrings that describe the
| tool use. The SDK handles converting the docstrings into a format
| the LLM can recognize, so my tool's code documentation becomes
| the model's source of truth. I can also include usage examples
| right in the docstring to guide the LLM to work closely with all
| my available tools. No system prompt needed!
|
| Moreover, I wrote all my tools in a separate module, and just use
| `inspect.getmembers` to construct the `tools` list that i pass to
| Ollama. So when I need to write a new tool, I just write another
| function in the tools module and it Just Works(tm)
|
| Paired with qwen 32b running locally, i was fairly satisfied with
| the output.
| degamad wrote:
| > The only difference is that i pushed the tool call outputs
| into the conversation history and resent it back to the LLM for
| it to summarize, or perform further tool calls, if necessary,
| automagically.
|
| It looks like this one does that too. msg =
| [ handle_tool_call(tc) for tc in tool_calls ]
| mukesh610 wrote:
| Ah, failed to notice that.
|
| I was so excited because this was exactly what I coded up
| today, I jumped straight to the comments.
| jawns wrote:
| Not only can this be an effective strategy for coding tasks, but
| it can also be used for data querying. Picture a text-to-SQL
| agent that can query database schemas, construct queries, run
| explain plans, inspect the error outputs, and then loop several
| times to refine. That's the basic architecture behind a tool I
| built, and I have been amazed at how well it works. There have
| been multiple times when I've thought, "Surely it couldn't handle
| THIS prompt," but it does!
|
| Here's an AWS post that goes into detail about this approach:
| https://aws.amazon.com/blogs/machine-learning/build-a-robust...
| amelius wrote:
| Huh, isn't this already built-in, in most chat UIs?
| BrandiATMuhkuh wrote:
| That's really cool. One week ago I implemented an SQL tool and it
| works really well. But sometimes it still just makes up
| table/column names. Luckily it can read the error and correct
| itself.
|
| But today I went to the next level. I gave the LLM two tools. One
| web search tool and one REST tool.
|
| I told it at what URL it can find API docs. Then I asked it to
| perform some tasks for me.
|
| It was really cool to watch an AI read docs, make api calls and
| try again (REPL) until it worked
| rbren wrote:
| If you're interested in hacking on agent loops, come join us in
| the OpenHands community!
|
| Here's our (slightly more complicated) agent loop:
| https://github.com/All-Hands-AI/OpenHands/blob/f7cb2d0f64666...
| tqwhite wrote:
| I've been using Claude Code, ie, a terminal interface to Sonnet
| 3.7 since the day it came out in mid March. I have done
| substantial CLI apps, full stack web systems and a ton of utility
| crap. I am much more ambitious because of it, much as I was in
| the past when I was running a programming team.
|
| I'm sure it is much the same as this under the hood though
| Anthropic has added many insanely useful features.
|
| Nothing is perfect. Producing good code requires about the same
| effort as it did when I was running said team. It is possible to
| get complicated things working and find oneself in a mess where
| adding the next feature is really problematic. As I have learned
| to drive it, I have to do much less remediation and refactoring.
| That will never go away.
|
| I cannot imagine what happened to poor kgeist. I have had Claude
| make choices I wouldn't and do some stupid stuff, never enough
| that I would even think about giving up on it. Almost always, it
| does a decent job and, for a most stuff, the amount of work it
| takes off of my brain is IMMENSE.
|
| And, for good measure, it does a wonderful job of refactoring.
| Periodically, I have a session where I look at the code, decide
| how it could be better and instruct Claude. Huge amounts of
| complexity, done. "Change this data structure", done. It's
| amazingly cool.
|
| And, just for fun, I opened it in a non-code archive directory.
| It was a junk drawer that I've been filling for thirty years.
| "What's in this directory?" "Read the old resumes and write a new
| one." "What are my children's names?" Also amazing.
|
| And this is still early days. I am so happy.
| kuahyeow wrote:
| What protection do people use when enabling an LLM to run `bash`
| on your machine ? Do you run it in a Docker container / LXC
| boundary ? `chroot` ?
| outworlder wrote:
| > If you don't have some tool installed, it'll install it.
|
| Terrifying. LLMs are very 'accommodating' and all they need is
| someone asking them to do something. This is like SQL injection,
| but worse.
___________________________________________________________________
(page generated 2025-05-15 23:00 UTC)