[HN Gopher] The unreasonable effectiveness of an LLM agent loop ...
       ___________________________________________________________________
        
       The unreasonable effectiveness of an LLM agent loop with tool use
        
       Author : crawshaw
       Score  : 135 points
       Date   : 2025-05-15 19:33 UTC (3 hours ago)
        
 (HTM) web link (sketch.dev)
 (TXT) w3m dump (sketch.dev)
        
       | _bin_ wrote:
       | I've found sonnet-3.7 to be incredibly inconsistent. It can do
       | very well but has a strong tendency to get off-track and run off
       | and do weird things.
       | 
       | 3.5 is better for this, ime. I hooked claude desktop up to an MCP
       | server to fake claude-code less the extortionate pricing and it
       | works decently. I've been trying to apply it for rust work; it's
       | not great yet (still doesn't really seem to "understand" rust's
       | concepts) but can do some stuff if you make it `cargo check`
       | after each change and stop it if it doesn't.
       | 
       | I expect something like o3-high is the best out there (aider
       | leaderboards support this) either alone or in combination with
       | 4.1, but tbh that's out of my price range. And frankly, I can't
       | mentally get past paying a very high price for an LLM response
       | that may or may not be useful; it leaves me incredibly resentful
       | as a customer that your model can fail the task, requiring
       | multiple "re-rolls", and you're passing that marginal cost to me.
        
         | agilebyte wrote:
         | I am avoiding the cost of API access by using the chat/ui
         | instead, in my case Google Gemini 2.5 Pro with the high token
         | window. Repomix a whole repo. Paste it in with a standard
         | prompt saying "return full source" (it tends to not follow this
         | instruction after a few back and forths) and then apply the
         | result back on top of the repo (vibe coded
         | https://github.com/radekstepan/apply-llm-changes to help me
         | with that). Else yeah, $5 spent on Cline with Claude 3.7 and
         | instead of fixing my tests, I end up with if/else statements in
         | the source code to make the tests pass.
        
           | harvey9 wrote:
           | Guess it was trained by scraping thedailywtf.com
        
           | actsasbuffoon wrote:
           | I decided to experiment with Claude Code this month. The
           | other day it decided the best way to fix the spec was to add
           | a conditional to the test that causes it to return true
           | before getting to the thing that was actually supposed to be
           | tested.
           | 
           | I'm finding it useful for really tedious stuff like doing
           | complex, multi step terminal operations. For the coding...
           | it's not been great.
        
             | nico wrote:
             | I've had this in different ways many times. Like instead of
             | resolving the underlying issue for an exception, it just
             | suggests catching the exception and keep going
             | 
             | It also depends a lot on the mix of model and type of code
             | and libraries involved. Even in different days the models
             | seem to be more or less capable (I'm assuming they get
             | throttled internally - this is very noticeable sometimes in
             | how they try to save on output tokens and summarize the
             | code responses as much as possible, at least in the
             | chat/non-api interfaces)
        
             | christophilus wrote:
             | Well, that's proof that it used my GitHub projects in its
             | training data.
        
           | nico wrote:
           | Cool tool. What format does it expect from the model?
           | 
           | I've been looking for something that can take "bare diffs"
           | (unified diffs without line numbers), from the clipboard and
           | then apply them directly on a buffer (an open file in vscode)
           | 
           | None of the paste diff extension for vscode work, as they
           | expect a full unified diff/patch
           | 
           | I also tried a google-developed patch tool, but also wasn't
           | very good at taking in the bare diffs, and def couldn't do
           | clipboard
        
             | agilebyte wrote:
             | Markdown format with a comment saying what the file path
             | is. So:
             | 
             | This is src/components/Foo.tsx
             | 
             | ```tsx // code goes here ```
             | 
             | OR
             | 
             | ```tsx // src/components/Foo.tsx // code goes here ```
             | 
             | These seem to work the best.
             | 
             | I tried diff syntax, but Gemini 2.5 just produced way too
             | many bugs.
             | 
             | I also tried using regex and creating an AST of the
             | markdown doc and going from there, but ultimately settled
             | on calling gpt-4.1-mini-2025-04-14 with the beginning of
             | the code block (```) and 3 lines before and 3 lines after
             | the beginning of the code block. It's fast/cheap enough to
             | work.
             | 
             | Though I still have to make edits sometimes. WIP.
        
         | layoric wrote:
         | I've been using Mistral Medium 3 last couple of days, and I'm
         | honestly surprised at how good it is. Highly recommend giving
         | it a try if you haven't, especially if you are trying to reduce
         | costs. I've basically switched from Claude to Mistral and
         | honestly prefer it even if costs were equal.
        
           | nico wrote:
           | How are you running the model? Mistral's api or some local
           | version through ollama, or something else?
        
             | kyleee wrote:
             | Is mistral on open router?
        
               | nico wrote:
               | Yup https://openrouter.ai/provider/mistral
               | 
               | I guess it can't really be run locally https://www.reddit
               | .com/r/LocalLLaMA/comments/1kgyfif/introdu...
        
         | johnsmith1840 wrote:
         | I seem to be alone in this but the only methods truly good at
         | coding are slow heavy test time compute models.
         | 
         | o1-pro and o1-preview are the only models I've ever used that
         | can reliably update and work with 1000 LOC without error.
         | 
         | I don't let o3 write any code unless it's very small. Any
         | "cheap" model will hallucinate or fail massively when pushed.
         | 
         | One good tip I've done lately. Remove all comments in your code
         | before passing or using LLMs, don't let LLM generated comments
         | persist under any circumstance.
        
           | _bin_ wrote:
           | Interesting. I've never tested o1-pro because it's insanely
           | expensive but preview seemed to do okay.
           | 
           | I wouldn't be shocked if huge, expensive-to-run models
           | performed better and if all the "optimized" versions were
           | actually labs trying to ram cheaper bullshit down everyone's
           | throat. Basically chinesium for LLMs; you can afford them but
           | it's not worth it. I remember someone saying o1 was, what,
           | 200B dense? I might be misremembering.
        
             | johnsmith1840 wrote:
             | I'm positive they are pushing users to cheaper models due
             | to cost. o1-pro is now in a sub menu for pro users and
             | labled legacy. The big inference methods must be stupidly
             | expensive.
             | 
             | o1-preview was and possibly still is the most powerful
             | model they ever released. I only switched to pro for coding
             | after months of them improving it and my api bill getting a
             | bit crazy (like 0.50$ per question).
             | 
             | I don't think paramater count matters anymore. I think the
             | only thing that matters is how much compute a vendor will
             | give you per question.
        
       | jbellis wrote:
       | Yes!
       | 
       | Han Xiao at Jina wrote a great article that goes into a lot more
       | detail on how to turn this into a production quality agentic
       | search: https://jina.ai/news/a-practical-guide-to-implementing-
       | deeps...
       | 
       | This is the same principle that we use at Brokk for Search and
       | for Architect. (https://brokk.ai/)
       | 
       | The biggest caveat: some models just suck at tool calling, even
       | "smart" models like o3. I only really recommend Gemini Pro 2.5
       | for Architect (smart + good tool calls); Search doesn't require
       | as high a degree of intelligence and lots of models work (Sonnet
       | 3.7, gpt-4.1, Grok 3 are all fine).
        
         | crawshaw wrote:
         | I'm curious about your experiences with Gemini Pro 2.5 tool
         | calling. I have tried using it in agent loops (in fact, sketch
         | has some rudimentary support I added), and compared with the
         | Anthropic models I have had to actively reprompt Gemini
         | regularly to make tool calls. Do you consider it equivalent to
         | Sonnet 3.7? Has it required some prompt engineering?
        
           | jbellis wrote:
           | Confession time: litellm still doesn't support parallel tool
           | calls with Gemini models
           | [https://github.com/BerriAI/litellm/issues/9686] so we wrote
           | our own "parallel tool calls" on top of Structured Output. It
           | did take a few iterations on the prompt design but all of it
           | was "yeah I can see why that was ambiguous" kinds of things,
           | no real complaints.
           | 
           | GP2.5 does have a different flavor than S3.7 but it's hard to
           | say that one is better or worse than the other [edit: at tool
           | calling -- GP2.5 is definitely smarter in general]. GP2.5 is
           | I would say a bit more aggressive at doing "speculative" tool
           | execution in parallel with the architect, e.g. spawning
           | multiple search agent calls at the same time, which for Brokk
           | is generally a good thing but I could see use cases where
           | you'd want to dial that back.
        
       | kgeist wrote:
       | Today I tried "vibe-coding" for the first time using GPT-4o and
       | 4.1. I did it manually - just feeding compilation errors,
       | warnings, and suggestions in a loop via the canvas interface. The
       | file was small, around 150 lines.
       | 
       | It didn't go well. I started with 4o:
       | 
       | - It used a deprecated package.
       | 
       | - After I pointed that out, it didn't update all usages - so I
       | had to fix them manually.
       | 
       | - When I suggested a small logic change, it completely broke the
       | syntax (we're talking "foo() } return )))" kind of broken) and
       | never recovered. I gave it the raw compilation errors over and
       | over again, but it didn't even register the syntax was off - just
       | rewrote random parts of the code instead.
       | 
       | - Then I thought, "maybe 4.1 will be better at coding" (as
       | advertized). But 4.1 refused to use the canvas at all. It just
       | explained what I could change - as in, you go make the edits.
       | 
       | - After some pushing, I got it to use the canvas and return the
       | full code. Except it didn't - it gave me a truncated version of
       | the code with comments like "// omitted for brevity".
       | 
       | That's when I gave up.
       | 
       | Do agents somehow fix this? Because as it stands, the experience
       | feels completely broken. I can't imagine giving this access to
       | bash, sounds way too dangerous.
        
         | vFunct wrote:
         | Use Claude Sonnet with an IDE.
        
         | 85392_school wrote:
         | Agents definitely fix this. When you can run commands and edit
         | files, the agent can test its code by itself and fix any
         | issues.
        
         | visarga wrote:
         | You should try Cursor or Windsurf, with Claude or Gemini model.
         | Create a documentation file first. Generate tests for
         | everything. The more the better. Then let it cycle 100 times
         | until tests pass.
         | 
         | Normal programming is like walking, deliberate and sure. Vibe
         | coding is like surfing, you can't control everything, just hit
         | yes on auto. Trust the process, let it make mistakes and
         | recover on its own.
        
           | prisenco wrote:
           | Given that analogy, surely you could understand why someone
           | would much rather walk than surf to their destination?
           | Especially people who are experienced marathon runners.
        
             | fragmede wrote:
             | If I tried standing up on the waves without a surfboard,
             | and complain about how it's not working, would you blame
             | the water or surfing for the issue, or the person trying to
             | defy physics, complaining that it's not working? It doesn't
             | matter how much I want to run or if I'm Kelvin Kiptum, I'm
             | gonna have a bad time.
        
               | prisenco wrote:
               | That only makes sense when surfing is the only way to get
               | to the destination and that's not the case.
        
           | tqwhite wrote:
           | I find that writing a thorough design spec is really worth
           | it. Also, asking for its reaction. "What's missing?" "Should
           | I do X or Y" does good things for its thought process, like
           | engaging a younger programmer in the process.
           | 
           | Definitely, I ask for a plan and then, even if it's obvious,
           | I ask questions and discuss it. I also point it as samples of
           | code that I like with instructions for what is good about it.
           | 
           | Once we have settled on a plan, I ask it to break it into
           | phases that can be tested (I am not one for a unit testing)
           | to lock in progress. Claude LOVES that. It organizes a new
           | plan and, at the end of each phase tells me how to test
           | (curl, command line, whatever is appropriate) and what I
           | should see that represents success.
           | 
           | The most important thing I have figured out is that Claude is
           | a collaborator, not a minion. I agree with visarga, it's much
           | more like surfing that walking. Also, Trust... but Verify.
           | 
           | This is a great time to be a programmer.
        
         | abiraja wrote:
         | GPT4o and 4.1 are definitely not the best models to use here.
         | Use Claude 3.5/3.7, Gemini Pro 2.5 or o3. All of them work
         | really well for small files.
        
         | theropost wrote:
         | 150 lines? I find can quickly scale to around 1500 lines, and
         | then start more precision on the classes, and functions I am
         | looking to modify
        
           | jokethrowaway wrote:
           | It's completely broken for me over 400 lines (Claude 3.7,
           | paid Cursor)
           | 
           | The worst is when I ask something complex, the model
           | generates 300 lines of good code and then timeouts or
           | crashes. If I ask to continue it will mess up the code for
           | good, eg. starts generating duplicated code or functions
           | which don't match the rest of the code.
        
             | johnsmith1840 wrote:
             | It's a new skill that takes time to learn. When I started
             | on gpt3.5 it took me easily 6 months of daily use before I
             | was making real progress with it.
             | 
             | I regularly generate and run in the 600-1000LOC range.
             | 
             | Not sure you would call it "vibe coding" though as the
             | details and info you provide it and how you provide it is
             | not simple.
             | 
             | I'd say realistically it speeds me up 10x on fresh
             | greenfield projects and maybe 2x on mature systems.
             | 
             | You should be reading the code coming out. The real way to
             | prevent errors is read the resoning and logic. The moment
             | you see a mistep go back and try the prompt again. If that
             | fails try a new session entirely.
             | 
             | Test time compute models like o1-pro or the older
             | o1-preview are massively better at not putting errors in
             | your code.
             | 
             | Not sure about the new claude method but true, slow test
             | time models are MASSIVELY better at coding.
        
             | fragmede wrote:
             | what language?
        
             | koakuma-chan wrote:
             | Sounds like a Cursor issue
        
             | tqwhite wrote:
             | Definitely a new skill to learn. Everyone I know that is
             | having problems is just telling it what to do, not coaching
             | it. It is not an automaton... instructions in code out.
             | Treat it like a team member that will do the work if you
             | teach it right and you will have much more success.
             | 
             | But is definitely a learning process for you.
        
         | nico wrote:
         | 4o and 4.1 are not very good at coding
         | 
         | My best results are usually with 4o-mini-high, o3 is sometimes
         | pretty good
         | 
         | I personally don't like the canvas. I prefer the output on the
         | chat
         | 
         | And a lot of times I say: provide full code for this file, or
         | provide drop-in replacement (when I don't want to deal with all
         | the diffs). But usually at around 300-400 lines of code, it
         | starts getting bad and then I need to refactor to break stuff
         | up into multiple files (unless I can focus on just one method
         | inside a file)
        
           | manmal wrote:
           | o3 is shockingly good actually. I can't use it often due to
           | rate limiting, so I save it for the odd occasion. Today I
           | asked it how I could integrate a tree of Swift binary
           | packages within an SDK, and detect internal version clashes,
           | and it gave a very well researched and sensible overview. And
           | gave me a new idea that I'll try.
        
             | kenjackson wrote:
             | I use o3 for anything math or coding related. 4o is good
             | for things like, "my knee hurts when I do this and that --
             | what might it be?"
        
             | hnhn34 wrote:
             | Just in case you didn't know, they raised the rate limit
             | from ~50/week to ~50/day a while ago
        
           | johnsmith1840 wrote:
           | Drop in replacement files per update should be done on the
           | heavy test time compute methods.
           | 
           | o1-pro, o1-preview can generate updated full file responses
           | into the 1k LOC range.
           | 
           | It's something about their internal verification methods that
           | make it an actual viable development method.
        
             | nico wrote:
             | True. Also, the APIs don't care too much about restricting
             | output length, they might actually be more verbose to
             | charge more
             | 
             | It's interesting how the same model being served through
             | different interfaces (chat vs api), can behave differently
             | based on the economic incentives of the providers
        
         | smcleod wrote:
         | GPT 4o and 4.1 are both pretty terrible for coding to be
         | honest, try Sonnet 3.7 in Cline (VSCode extension).
         | 
         | LLMs don't have up to date knowledge of packages by themselves
         | that's a bit like buying a book and expecting it to have up to
         | date world knowledge, you need to supplement / connect it to a
         | data source (e.g. web search, documentation and package version
         | search etc.).
        
         | hollownobody wrote:
         | Try o3 please. Via UI.
        
         | koakuma-chan wrote:
         | You gotta use a reasoning model.
        
         | Jarwain wrote:
         | Aider's benchmarks show 4.1 (and 4o) work better in its
         | architect mode, for planning the changes, and o3 for making the
         | actual edits
        
         | thorum wrote:
         | GPT 4.1 and 4o score very low on the Aider coding benchmark.
         | You only start to get acceptable results with models that score
         | 70%+ in my experience. Even then, don't expect it to do
         | anything complex without a lot of hand-holding. You start to
         | get a sense for what works and what doesn't.
         | 
         | https://aider.chat/docs/leaderboards/
        
         | ebiester wrote:
         | I get that it's frustrating to be told "skill issue," but using
         | an LLM is absolutely a skill and there's a combination of
         | understanding the strengths of various tools, experimenting
         | with them to understand the techniques, and just pure practice.
         | 
         | I think if I were giving access to bash, though, it would
         | definitely be in a docker container for me as well.
        
         | fsndz wrote:
         | I can be frustrating at times. but my experience is the more
         | you try the better you become at knowing what to ask and to
         | expect. But I guess you understand now why some people say vibe
         | coding is a bit overrated: https://www.lycee.ai/blog/why-vibe-
         | coding-is-overrated
        
           | the_af wrote:
           | "Overrated" is one way to call it.
           | 
           | Giving sharp knives to monkeys would be another.
        
         | simonw wrote:
         | "It used a deprecated package"
         | 
         | That's because models have training cut-off dates. It's
         | important to take those into account when working with them:
         | https://simonwillison.net/2025/Mar/11/using-llms-for-code/#a...
         | 
         | I've switched to o4-mini-high via ChatGPT as my default model
         | for a lot of code because it can use its search function to
         | lookup the latest documentation.
         | 
         | You can tell it "look up the most recent version of library X
         | and use that" and it will often work!
         | 
         | I even used it for a frustrating upgrade recently - I pasted in
         | some previous code and prompted this:
         | 
         |  _This code needs to be upgraded to the new recommended
         | JavaScript library from Google. Figure out what that is and
         | then look up enough documentation to port this code to it._
         | 
         | It did exactly what I asked:
         | https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...
        
           | kgeist wrote:
           | >That's because models have training cut-off dates
           | 
           | When I pointed out that it used a deprecated package, it
           | agreed and even cited the correct version after which it was
           | deprecated (way back in 2021). So it knows it's deprecated,
           | but the next-token prediction (without reasoning or tools)
           | still can't connect the dots when much of the training data
           | (before 2021) uses that package as if it's still acceptable.
           | 
           | >I've switched to o4-mini-high via ChatGPT as my default
           | model for a lot of code because it can use its search
           | function to lookup the latest documentation.
           | 
           | Thanks for the tip!
        
             | fragmede wrote:
             | There's still skill involved with using the LLM in coding.
             | In this case, o4-mini-high might do the trick, but the
             | easier answer that worry's with other models is to include
             | the high level library documentation yourself as context
             | and it'll use that API.
        
             | jmcpheron wrote:
             | >I've switched to o4-mini-high via ChatGPT as my default
             | model for a lot of code because it can use its search
             | function to lookup the latest documentation.
             | 
             | That is such a useful distinction. I like to think I'm
             | keeping up with this stuff, but the '4o' versus 'o4' still
             | throws me.
        
         | LewisVerstappen wrote:
         | skill issue.
         | 
         | The fact that you're using 4o and 4.1 rather than claude is
         | already a huge mistake in itself.
         | 
         | > Because as it stands, the experience feels completely broken
         | 
         | Broken for you. Not for everyone else.
        
         | codethief wrote:
         | The other day I used the Cline plugin for VSCode with Claude to
         | create an Android app prototype from "scratch", i.e. starting
         | from the usual template given to you by Android Studio. It
         | produced several thousand lines of code, there was not a single
         | compilation error, and the app ended up doing exactly what I
         | wanted - modulo a bug or two, which were caused not by the
         | LLM's stupidity but by weird undocumented behavior of the
         | rather arcane Android API in question. (Which is exactly why I
         | wanted a quick prototype.)
         | 
         | After pointing out the bugs to the LLM, it successfully
         | debugged them (with my help/feedback, i.e. I provided the
         | output of the debug messages it had added to the code) and
         | ultimately fixed them. The only downside was that I wasn't
         | quite happy with the quality of the fixes - they were more like
         | dirty hacks -, but oh well, after another round or two of
         | feedback we got there, too. I'm sure one could solve that more
         | generally, by putting the agent writing the code in a loop with
         | some other code reviewing agent.
        
         | fragmede wrote:
         | In this case, sorry to say but it sounds like there's a tooling
         | issue, possibly also a skill issue. Of course you can just use
         | the raw ChatGPT web interface but unless you seriously tune its
         | system/user prompt, it's not going to match what good tooling
         | (which sets custom prompts) will get you. Which is kind of
         | counter-intuitive. A paragraph or three fed in as the system
         | prompt is enough to influence behavior/performance so
         | significantly? It turns out with LLMs the answer is yes.
        
         | danbmil99 wrote:
         | As others have noted, you sound about 3 months behind the
         | leading edge. What you describe is like my experience from
         | February.
         | 
         | Switch to Claude (IMSHO, I think Gemini is considered on par).
         | Use a proper coding tool, cutting & pasting from the chat
         | window is so last week.
        
       | libraryofbabel wrote:
       | Strongly recommend this blog post too which is a much more
       | detailed and persuasive version of the same point. The author
       | actually goes and builds a coding agent from zero:
       | https://ampcode.com/how-to-build-an-agent
       | 
       | It is indeed astonishing how well a loop with an LLM that can
       | call tools works for all kinds of tasks now. Yes, sometimes they
       | go off the rails, there is the problem of getting that last 10%
       | of reliability, etc. etc., but if you're not at least a little
       | bit amazed then I urge you go to and hack together something like
       | this yourself, which will take you about 30 minutes. It's
       | possible to have a sense of wonder about these things without
       | giving up your healthy skepticism of whether AI is actually going
       | to be effective for this or that use case.
       | 
       | This "unreasonable effectiveness" of putting the LLM in a loop
       | also accounts for the enormous proliferation of coding agents out
       | there now: Claude Code, Windsurf, Cursor, Cline, Copilot, Aider,
       | Codex... and a ton of also-rans; as one HN poster put it the
       | other day, it seems like everyone and their mother is writing
       | one. The reason is that there is no secret sauce and 95% of the
       | magic is in the LLM itself and how it's been fine-tuned to do
       | tool calls. One of the lead developers of Claude Code candidly
       | admits this in a recent interview.[0] Of course, a ton of work
       | goes into making these tools work well, but ultimately they all
       | have the same simple core.
       | 
       | [0] https://www.youtube.com/watch?v=zDmW5hJPsvQ
        
         | sesm wrote:
         | Should we change the link above to use
         | `?utm_source=hn&utm_medium=browser` before opening it?
        
           | libraryofbabel wrote:
           | fixed :)
        
         | kcorbitt wrote:
         | For "that last 10% of reliability" RL is actually working
         | pretty well right now too! https://openpipe.ai/blog/art-e-mail-
         | agent
        
         | meander_water wrote:
         | There's also this one which uses pocketflow, a graph
         | abstraction library to create something similar [0]. I've been
         | using it myself and love the simplicity of it.
         | 
         | [0] https://github.com/The-Pocket/PocketFlow-Tutorial-
         | Cursor/blo...
        
         | wepple wrote:
         | Ah, it's Thorsten Ball!
         | 
         | I thoroughly enjoyed his "writing an interpreter". I guess I'm
         | going to build an agent now.
        
       | bhouston wrote:
       | I found this out too - it is quite easy and effective:
       | 
       | https://benhouston3d.com/blog/building-an-agentic-code-from-...
        
       | simonw wrote:
       | I'm _very_ excited about tool use for LLMs at the moment.
       | 
       | The trick isn't new - I first encountered it with the ReAcT paper
       | two years ago - https://til.simonwillison.net/llms/python-react-
       | pattern - and it's since been used for ChatGPT plugins, and
       | recently for MCP, and all of the models have been trained with
       | tool use / function calls in mind.
       | 
       | What's interesting today is how GOOD the models have got at it.
       | o3/o4-mini's amazing search performance is all down to tool
       | calling. Even Qwen3 4B (2.6GB from Ollama, runs happily on my
       | Mac) can do tool calling reasonably well now.
       | 
       | I gave a workshop at PyCon US yesterday about building software
       | on top of LLMs - https://simonwillison.net/2025/May/15/building-
       | on-llms/ - and used that as an excuse to finally add tool usage
       | to an alpha version of my LLM command-line tool. Here's the
       | section of the workshop that covered that:
       | 
       | https://building-with-llms-pycon-2025.readthedocs.io/en/late...
       | 
       | My LLM package can now reliably count the Rs in strawberry as a
       | shell one-liner:                 llm --functions '       def
       | count_char_in_string(char: str, string: str) -> int:
       | """Count the number of times a character appears in a string."""
       | return string.lower().count(char.lower())       ' 'Count the
       | number of Rs in the word strawberry' --td
        
         | andrewmcwatters wrote:
         | I love the odd combination of silliness and power in this.
        
         | DarmokJalad1701 wrote:
         | Was the workshop recorded?
        
       | mukesh610 wrote:
       | I built this very same thing today! The only difference is that i
       | pushed the tool call outputs into the conversation history and
       | resent it back to the LLM for it to summarize, or perform further
       | tool calls, if necessary, automagically.
       | 
       | I used ollama to build this and ollama supports tool calling
       | natively, by passing a `tools=[...]` in the Python SDK. The tools
       | can be regular Python functions with docstrings that describe the
       | tool use. The SDK handles converting the docstrings into a format
       | the LLM can recognize, so my tool's code documentation becomes
       | the model's source of truth. I can also include usage examples
       | right in the docstring to guide the LLM to work closely with all
       | my available tools. No system prompt needed!
       | 
       | Moreover, I wrote all my tools in a separate module, and just use
       | `inspect.getmembers` to construct the `tools` list that i pass to
       | Ollama. So when I need to write a new tool, I just write another
       | function in the tools module and it Just Works(tm)
       | 
       | Paired with qwen 32b running locally, i was fairly satisfied with
       | the output.
        
         | degamad wrote:
         | > The only difference is that i pushed the tool call outputs
         | into the conversation history and resent it back to the LLM for
         | it to summarize, or perform further tool calls, if necessary,
         | automagically.
         | 
         | It looks like this one does that too.                    msg =
         | [ handle_tool_call(tc) for tc in tool_calls ]
        
           | mukesh610 wrote:
           | Ah, failed to notice that.
           | 
           | I was so excited because this was exactly what I coded up
           | today, I jumped straight to the comments.
        
       | jawns wrote:
       | Not only can this be an effective strategy for coding tasks, but
       | it can also be used for data querying. Picture a text-to-SQL
       | agent that can query database schemas, construct queries, run
       | explain plans, inspect the error outputs, and then loop several
       | times to refine. That's the basic architecture behind a tool I
       | built, and I have been amazed at how well it works. There have
       | been multiple times when I've thought, "Surely it couldn't handle
       | THIS prompt," but it does!
       | 
       | Here's an AWS post that goes into detail about this approach:
       | https://aws.amazon.com/blogs/machine-learning/build-a-robust...
        
       | amelius wrote:
       | Huh, isn't this already built-in, in most chat UIs?
        
       | BrandiATMuhkuh wrote:
       | That's really cool. One week ago I implemented an SQL tool and it
       | works really well. But sometimes it still just makes up
       | table/column names. Luckily it can read the error and correct
       | itself.
       | 
       | But today I went to the next level. I gave the LLM two tools. One
       | web search tool and one REST tool.
       | 
       | I told it at what URL it can find API docs. Then I asked it to
       | perform some tasks for me.
       | 
       | It was really cool to watch an AI read docs, make api calls and
       | try again (REPL) until it worked
        
       | rbren wrote:
       | If you're interested in hacking on agent loops, come join us in
       | the OpenHands community!
       | 
       | Here's our (slightly more complicated) agent loop:
       | https://github.com/All-Hands-AI/OpenHands/blob/f7cb2d0f64666...
        
       | tqwhite wrote:
       | I've been using Claude Code, ie, a terminal interface to Sonnet
       | 3.7 since the day it came out in mid March. I have done
       | substantial CLI apps, full stack web systems and a ton of utility
       | crap. I am much more ambitious because of it, much as I was in
       | the past when I was running a programming team.
       | 
       | I'm sure it is much the same as this under the hood though
       | Anthropic has added many insanely useful features.
       | 
       | Nothing is perfect. Producing good code requires about the same
       | effort as it did when I was running said team. It is possible to
       | get complicated things working and find oneself in a mess where
       | adding the next feature is really problematic. As I have learned
       | to drive it, I have to do much less remediation and refactoring.
       | That will never go away.
       | 
       | I cannot imagine what happened to poor kgeist. I have had Claude
       | make choices I wouldn't and do some stupid stuff, never enough
       | that I would even think about giving up on it. Almost always, it
       | does a decent job and, for a most stuff, the amount of work it
       | takes off of my brain is IMMENSE.
       | 
       | And, for good measure, it does a wonderful job of refactoring.
       | Periodically, I have a session where I look at the code, decide
       | how it could be better and instruct Claude. Huge amounts of
       | complexity, done. "Change this data structure", done. It's
       | amazingly cool.
       | 
       | And, just for fun, I opened it in a non-code archive directory.
       | It was a junk drawer that I've been filling for thirty years.
       | "What's in this directory?" "Read the old resumes and write a new
       | one." "What are my children's names?" Also amazing.
       | 
       | And this is still early days. I am so happy.
        
       | kuahyeow wrote:
       | What protection do people use when enabling an LLM to run `bash`
       | on your machine ? Do you run it in a Docker container / LXC
       | boundary ? `chroot` ?
        
       | outworlder wrote:
       | > If you don't have some tool installed, it'll install it.
       | 
       | Terrifying. LLMs are very 'accommodating' and all they need is
       | someone asking them to do something. This is like SQL injection,
       | but worse.
        
       ___________________________________________________________________
       (page generated 2025-05-15 23:00 UTC)