[HN Gopher] LLM function calls don't scale; code orchestration i...
       ___________________________________________________________________
        
       LLM function calls don't scale; code orchestration is simpler, more
       effective
        
       Author : jngiam1
       Score  : 132 points
       Date   : 2025-05-21 17:18 UTC (5 hours ago)
        
 (HTM) web link (jngiam.bearblog.dev)
 (TXT) w3m dump (jngiam.bearblog.dev)
        
       | avereveard wrote:
       | That's kind of the entire premise of huggingface smolagent and
       | while it does work really well when it works it also increase the
       | challenges in rolling back failed actions
       | 
       | I guess one could in principle wrap the entire execution block
       | into a distributed transaction, but llm try to make code that is
       | robust, which works against this pattern as it makes hard to
       | understand failure
        
         | jngiam1 wrote:
         | Agree, the smolagent premise is good; but the hard part is
         | handling execution, errors, etc.
         | 
         | For example, when the code execution fails mid-way, we really
         | want the model to be able to pick up from where it failed (with
         | the states of the variables at the time of failure) and be able
         | to continue from there.
         | 
         | We've found that the LLM is able to generate correct code that
         | picks up gracefully. The hard part now is building the runtime
         | that makes that possible; we've something that works pretty
         | well in many cases now in production at Lutra.
        
           | hooverd wrote:
           | Could you implement an actual state machine and have your
           | agent work with that?
        
             | avereveard wrote:
             | that's the langraph idea. each langraph node can then be a
             | smolagent
             | 
             | latency tho, would be unbearable for real time.
        
           | avereveard wrote:
           | I think in principle you can make the entire API exposed to
           | the llm idempotent so that it bicomes irrelevant for the
           | backend wheter the llm replay the whole action or just the
           | failed steps
        
             | jngiam1 wrote:
             | That'd work well for read-only APIs, but we also want the
             | LLMs to be able to update data, create documents, etc.
             | Feels a bit harder when there are side-effects.
        
       | abelanger wrote:
       | > Most execution environments are stateful (e.g., they may rely
       | on running Jupyter kernels for each user session). This is hard
       | to manage and expensive if users expect to be able to come back
       | to AI task sessions later. A stateless-but-persistent execution
       | environment is paramount for long running (multi-day) task
       | sessions.
       | 
       | It's interesting how architectural patterns built at large tech
       | companies (for completely different use-cases than AI) have
       | become so relevant to the AI execution space.
       | 
       | You see a lot of AI startups learning the hard way that value of
       | event sourcing and (eventually) durable execution, but these
       | patterns aren't commonly adopted on Day 1. I blame the AI
       | frameworks.
       | 
       | (disclaimer - currently working on a durable execution platform)
        
         | th0ma5 wrote:
         | I see all of this as a constant negotiation of what is and
         | isn't needed out of traditional computing. Eventually they find
         | that what they want from any of it is determinism,
         | unfortunately for LLMs.
        
       | visarga wrote:
       | Maybe we just need models that can reference spans by start:end
       | range. Then they can pass arguments by reference instead of
       | explicit quotation. We can use these spans as answers in
       | extractive QA tasks, or as arguments for a code block, or to
       | construct a graph from pointers, and do graph computation. If we
       | define a "hide span" operation the LLM could dynamically open and
       | close its context, which could lead to context size reduction.
       | Basically - add explicit indexing to context memory, and make it
       | powerful, the LLM can act like a CPU.
        
       | hintymad wrote:
       | I feel that the optimal solution is hybrid, not polarized. That
       | is, we use deterministic approach as much as we can, but leverage
       | LLMs to handle the remaining complex part that is hard to spec
       | out or describe deterministically
        
         | jngiam1 wrote:
         | Yes - in particular, I think one interesting angle is use the
         | LLM to generate deterministic approaches (code). And then, if
         | the code works, save it for future use and it becomes
         | deterministic moving forward.
        
           | hintymad wrote:
           | Yes, and the other way around: use the deterministic methods
           | to generate the best possible input to LLM.
        
             | seunosewa wrote:
             | Can you give an example so we can visualise this?
        
               | hintymad wrote:
               | For instance, in an AIOps project we still perform a
               | number of time series algorithms and then feed the
               | results along with the original time series data to LLM.
               | LLM will produce much more relevant and in-depth analysis
               | than using the raw data along as input.
        
         | nowittyusername wrote:
         | I agree. You want to use as little LLM as possible in your
         | workflows.
        
           | mort96 wrote:
           | I've been developing software for decades without LLMs, turns
           | out you can get away with very little!
        
       | obiefernandez wrote:
       | My team at Shopify just open sourced Roast [1] recently. It lets
       | us embed non-deterministic LLM jobs within orchestrated
       | workflows. Essential when trying to automate work on codebases
       | with millions of lines of code.
       | 
       | [1] https://github.com/shopify/roast
        
         | drewda wrote:
         | Nice to see Ruby continuing to exist and deliver... even in the
         | age of "AI"
        
         | TheTaytay wrote:
         | Wow - Roast looks fantastic. You architected and put names and
         | constraints on some things that I've been wrestling with for a
         | while. I really like how you are blending the determinism and
         | non-determinism. (One thing that is not obvious to me after
         | reading the README a couple of times (quickly), is whether/how
         | the LLM can orchestrate multiple tool calls if necessary and
         | make decisions about which tools to call in which order. It
         | seems like it does when you tell it to refactor, but I couldn't
         | tell if this would be suitable for the task of "improve, then
         | run tests. Repeat until done.")
        
       | codyb wrote:
       | I'm slightly confused as to why you'd use a LLM to sort
       | structured data in the first place?
        
         | jngiam1 wrote:
         | The goal is to do more complex data processing, like build
         | dashboards, agentically figure out which tickets are stalled,
         | do a quarterly review of things done, etc. Sorting is a tiny
         | task in the bigger ones, but hopefully more easily exemplifies
         | the problem.
        
           | kikimora wrote:
           | I don't understand how this can work. Given probabilistic
           | nature of LLMs the more steps you have more chances something
           | goes off. What is good in the dashboard if you cannot be sure
           | it was not partially hallucinated?
        
             | staunton wrote:
             | > What is good in the dashboard if you cannot be sure it
             | was not partially hallucinated?
             | 
             | A lot of the time the dashboard contents doesn't actually
             | matter anyway, just needs to look pretty...
             | 
             | On a serious note, the systems being built now will
             | eventually be "correct enough most of the time" and that
             | will be good enough (read: cheaper than doing it any other
             | way).
        
             | orbital-decay wrote:
             | Probabilistic nature means nothing on its own. LLM that can
             | solve your deterministic task will easily assign 100% to
             | the correct answer (or 99%, the noise floor can be
             | truncated with a sampler). If it doesn't do that and your
             | reply is unstable, it cannot solve it confidently. Which
             | happens to all LLMs on a sufficiently complex task, but
             | it's not related to their probabilistic nature.
             | 
             | Of course that still doesn't mean that you should do that.
             | If you want to maximize model's performance, offload as
             | much distracting stuff as possible to the code.
        
       | koakuma-chan wrote:
       | > TL;DR: Giving LLMs the full output of tool calls is costly and
       | slow.
       | 
       | Is this true for all tool calls? Even if the tool returns little
       | data?
        
         | fullstackchris wrote:
         | from my experience its about the speed of a very competant
         | human - one of my favorite custom tools ive written is just
         | access to a series of bash commands - havent tested with others
         | but claude very quickly browses through files, reads them, and
         | so on to do whatever it was you prompted. But even then it is
         | all contextual - for example, I had to remove 'find' because as
         | one would expect, running 'find' against a huge directory set
         | is very slow!
        
       | fullstackchris wrote:
       | This is exactly what I've encountered, at least with Claude, it
       | writes out huge artifacts (static ones retrieved from the file
       | system or wherever) character for character - What I'm going to
       | try this weekend is just integrating a redis cache or sqlite into
       | the MCP tool calls, so claude doesnt have to write everything out
       | character per character... no idea if it will work as expected...
       | 
       | also looking into "fire and forget" tools, to see even if that is
       | possible
        
         | mehdibl wrote:
         | You don't have to use full write.
         | 
         | Use grep & edit lines. and sequences instead of full files.
         | 
         | This way you can edit files with 50kl loc without issue while
         | Claude will blow out if you ever try to write such file.
        
       | mehdibl wrote:
       | The issue is not in function calls but HOW MCP got designed here
       | and you are using.
       | 
       | Most MCP are replicating API. Returning blobs of data.
       | 
       | 1. This is using a lot of input context in formating as JSON and
       | escaping a Json inside already a JSON. 2. This contain a lot of
       | irrelevant information that you can same on it.
       | 
       | So the issue is the MCP tool. It should instead flaten the data
       | as possible as it's going back again thru JSON Encoding. And if
       | needed remove some fields.
       | 
       | So MCP SAAS here are mainly API gateways.
       | 
       | That brings this noise! And most of ALL they are not optimizing
       | MCP's.
        
       | CSMastermind wrote:
       | LLMs clearly struggle when presented with JSON, especially large
       | amounts of it.
       | 
       | There's nothing stopping your endpoints from returning data in
       | some other format. LLMs actually seem to excel with XML for
       | instance. But you could just use a template to define some
       | narrative text.
        
         | ryoshu wrote:
         | I'm consistently surprised that people don't use XML for LLMs
         | as the default given XML comes with built-in semantic context.
         | Convert the XML to JSON output deterministically when you need
         | to feed it to other pipelines.
        
         | iJohnDoe wrote:
         | Any reason for this for my own learning? Was XML more prevalent
         | during training? Something better about XML that makes it
         | easier for the LLM to work with?
         | 
         | XML seems more text heavy, more tokens. However, maybe more
         | context helps?
        
       | bguberfain wrote:
       | I think that there may be another solution for this, that is the
       | LLM write a valid code that calls the MCP's as functions. See it
       | like a Python script, where each MCP is mapped to a function. A
       | simple example:                 def process(param1, param2):
       | my_data = mcp_get_data(param1)          sorted_data =
       | mcp_sort(my_data, by=param2)          return sorted_data
        
         | jngiam1 wrote:
         | Yes! If you want to see how this can work in practice, check
         | out https://lutra.ai ; we've been using a similar pattern
         | there. The challenge is making the code runtime work well for
         | it.
        
       | padjo wrote:
       | Sorry I've been out of the industry for the last year or so, is
       | this madness really what people are doing now?
        
         | _se wrote:
         | No, not most people. But some people are experimenting.
         | 
         | No one has found anything revolutionary yet, but there are some
         | useful applications to be sure.
        
       | norcalkc wrote:
       | > Allowing an execution environment to also access MCPs, tools,
       | and user data requires careful design to where API keys are
       | stored, and how tools are exposed.
       | 
       | If your tools are calling APIs on-behalf of users, it's better to
       | use OAuth flows to enable users of the app to give explicit
       | consent to the APIs/scopes they want the tools to access. That
       | way, tools use scoped tokens to make calls instead of hard to
       | manage, maintain API keys (or even client credentials).
        
         | iandanforth wrote:
         | Do you know of any examples which use MCP and oauth cleanly?
        
       | darkteflon wrote:
       | We've been using smolagents, which takes this approach, and are
       | impressed.
       | 
       | Slight tangent, but as a long term user of OpenAI models, I was
       | surprised at how well Claude Sonnet 3.7 through the desktop app
       | handles multi-hop problem solving using tools (over MCP). As long
       | as tool descriptions are good, it's quite capable of chaining and
       | "lateral thinking" without any customisation of the system or
       | user prompts.
       | 
       | For those of you using Sonnet over API: is this behaviour similar
       | there out of the box? If not, does simply pasting the recently
       | exfiltrated[1] "agentic" prompt into the API system prompt get
       | you (most of the way) there?
       | 
       | [1] https://news.ycombinator.com/item?id=43909409
        
         | 3abiton wrote:
         | How does it compare to MCP servers?
        
       | arjunchint wrote:
       | I am kind of confused why can't you just create a new MCP tool
       | that encapsulates parsing and other required steps together in a
       | code block?
       | 
       | This would be more reliable than expecting the LLM to generate
       | working code 100% of the time?
        
         | Centigonal wrote:
         | You should for sure do this for common post processing tasks.
         | However, you're usually not going to know all the types of
         | post-processing users will want to do with tool call output at
         | design-time.
        
       | darkteflon wrote:
       | What are the current best options for sandboxed execution
       | environments? HuggingFace seems to have a tie-up with E2B,
       | although by default smolagents runs something ephemeral in-
       | process. I feel like there must be a good Docker container
       | solution to this that doesn't require signing up to yet another
       | SaaS. Any recommendations?
        
         | colonCapitalDee wrote:
         | Try gVisor
        
       | iLoveOncall wrote:
       | That's MCP for you.
       | 
       | MCP is literally just a wrapper around an API call, but because
       | it has some LLM buzz sprinkled on top, people expect it to do
       | some magic, when they wouldn't expect the same magic from the
       | underlying API.
        
       | stavros wrote:
       | I would really like to see output-aware LLM inference engines.
       | For example, imagine if the LLM output some tokens that meant
       | "I'm going to do a tool call now", and the inference engine (e.g.
       | llama.cpp) changed the grammar on the fly so the next token could
       | only be valid for the available tools.
       | 
       | Or, if I gave the LLM a list of my users and asked it to filter
       | based on some criteria, the grammar would change to only output
       | user IDs that existed in my list.
       | 
       | I don't know how useful this would be in practice, but at least
       | it would make it impossible for the LLM to hallucinate for these
       | cases.
        
       ___________________________________________________________________
       (page generated 2025-05-21 23:00 UTC)