[HN Gopher] LLM function calls don't scale; code orchestration i...
___________________________________________________________________
LLM function calls don't scale; code orchestration is simpler, more
effective
Author : jngiam1
Score : 132 points
Date : 2025-05-21 17:18 UTC (5 hours ago)
(HTM) web link (jngiam.bearblog.dev)
(TXT) w3m dump (jngiam.bearblog.dev)
| avereveard wrote:
| That's kind of the entire premise of huggingface smolagent and
| while it does work really well when it works it also increase the
| challenges in rolling back failed actions
|
| I guess one could in principle wrap the entire execution block
| into a distributed transaction, but llm try to make code that is
| robust, which works against this pattern as it makes hard to
| understand failure
| jngiam1 wrote:
| Agree, the smolagent premise is good; but the hard part is
| handling execution, errors, etc.
|
| For example, when the code execution fails mid-way, we really
| want the model to be able to pick up from where it failed (with
| the states of the variables at the time of failure) and be able
| to continue from there.
|
| We've found that the LLM is able to generate correct code that
| picks up gracefully. The hard part now is building the runtime
| that makes that possible; we've something that works pretty
| well in many cases now in production at Lutra.
| hooverd wrote:
| Could you implement an actual state machine and have your
| agent work with that?
| avereveard wrote:
| that's the langraph idea. each langraph node can then be a
| smolagent
|
| latency tho, would be unbearable for real time.
| avereveard wrote:
| I think in principle you can make the entire API exposed to
| the llm idempotent so that it bicomes irrelevant for the
| backend wheter the llm replay the whole action or just the
| failed steps
| jngiam1 wrote:
| That'd work well for read-only APIs, but we also want the
| LLMs to be able to update data, create documents, etc.
| Feels a bit harder when there are side-effects.
| abelanger wrote:
| > Most execution environments are stateful (e.g., they may rely
| on running Jupyter kernels for each user session). This is hard
| to manage and expensive if users expect to be able to come back
| to AI task sessions later. A stateless-but-persistent execution
| environment is paramount for long running (multi-day) task
| sessions.
|
| It's interesting how architectural patterns built at large tech
| companies (for completely different use-cases than AI) have
| become so relevant to the AI execution space.
|
| You see a lot of AI startups learning the hard way that value of
| event sourcing and (eventually) durable execution, but these
| patterns aren't commonly adopted on Day 1. I blame the AI
| frameworks.
|
| (disclaimer - currently working on a durable execution platform)
| th0ma5 wrote:
| I see all of this as a constant negotiation of what is and
| isn't needed out of traditional computing. Eventually they find
| that what they want from any of it is determinism,
| unfortunately for LLMs.
| visarga wrote:
| Maybe we just need models that can reference spans by start:end
| range. Then they can pass arguments by reference instead of
| explicit quotation. We can use these spans as answers in
| extractive QA tasks, or as arguments for a code block, or to
| construct a graph from pointers, and do graph computation. If we
| define a "hide span" operation the LLM could dynamically open and
| close its context, which could lead to context size reduction.
| Basically - add explicit indexing to context memory, and make it
| powerful, the LLM can act like a CPU.
| hintymad wrote:
| I feel that the optimal solution is hybrid, not polarized. That
| is, we use deterministic approach as much as we can, but leverage
| LLMs to handle the remaining complex part that is hard to spec
| out or describe deterministically
| jngiam1 wrote:
| Yes - in particular, I think one interesting angle is use the
| LLM to generate deterministic approaches (code). And then, if
| the code works, save it for future use and it becomes
| deterministic moving forward.
| hintymad wrote:
| Yes, and the other way around: use the deterministic methods
| to generate the best possible input to LLM.
| seunosewa wrote:
| Can you give an example so we can visualise this?
| hintymad wrote:
| For instance, in an AIOps project we still perform a
| number of time series algorithms and then feed the
| results along with the original time series data to LLM.
| LLM will produce much more relevant and in-depth analysis
| than using the raw data along as input.
| nowittyusername wrote:
| I agree. You want to use as little LLM as possible in your
| workflows.
| mort96 wrote:
| I've been developing software for decades without LLMs, turns
| out you can get away with very little!
| obiefernandez wrote:
| My team at Shopify just open sourced Roast [1] recently. It lets
| us embed non-deterministic LLM jobs within orchestrated
| workflows. Essential when trying to automate work on codebases
| with millions of lines of code.
|
| [1] https://github.com/shopify/roast
| drewda wrote:
| Nice to see Ruby continuing to exist and deliver... even in the
| age of "AI"
| TheTaytay wrote:
| Wow - Roast looks fantastic. You architected and put names and
| constraints on some things that I've been wrestling with for a
| while. I really like how you are blending the determinism and
| non-determinism. (One thing that is not obvious to me after
| reading the README a couple of times (quickly), is whether/how
| the LLM can orchestrate multiple tool calls if necessary and
| make decisions about which tools to call in which order. It
| seems like it does when you tell it to refactor, but I couldn't
| tell if this would be suitable for the task of "improve, then
| run tests. Repeat until done.")
| codyb wrote:
| I'm slightly confused as to why you'd use a LLM to sort
| structured data in the first place?
| jngiam1 wrote:
| The goal is to do more complex data processing, like build
| dashboards, agentically figure out which tickets are stalled,
| do a quarterly review of things done, etc. Sorting is a tiny
| task in the bigger ones, but hopefully more easily exemplifies
| the problem.
| kikimora wrote:
| I don't understand how this can work. Given probabilistic
| nature of LLMs the more steps you have more chances something
| goes off. What is good in the dashboard if you cannot be sure
| it was not partially hallucinated?
| staunton wrote:
| > What is good in the dashboard if you cannot be sure it
| was not partially hallucinated?
|
| A lot of the time the dashboard contents doesn't actually
| matter anyway, just needs to look pretty...
|
| On a serious note, the systems being built now will
| eventually be "correct enough most of the time" and that
| will be good enough (read: cheaper than doing it any other
| way).
| orbital-decay wrote:
| Probabilistic nature means nothing on its own. LLM that can
| solve your deterministic task will easily assign 100% to
| the correct answer (or 99%, the noise floor can be
| truncated with a sampler). If it doesn't do that and your
| reply is unstable, it cannot solve it confidently. Which
| happens to all LLMs on a sufficiently complex task, but
| it's not related to their probabilistic nature.
|
| Of course that still doesn't mean that you should do that.
| If you want to maximize model's performance, offload as
| much distracting stuff as possible to the code.
| koakuma-chan wrote:
| > TL;DR: Giving LLMs the full output of tool calls is costly and
| slow.
|
| Is this true for all tool calls? Even if the tool returns little
| data?
| fullstackchris wrote:
| from my experience its about the speed of a very competant
| human - one of my favorite custom tools ive written is just
| access to a series of bash commands - havent tested with others
| but claude very quickly browses through files, reads them, and
| so on to do whatever it was you prompted. But even then it is
| all contextual - for example, I had to remove 'find' because as
| one would expect, running 'find' against a huge directory set
| is very slow!
| fullstackchris wrote:
| This is exactly what I've encountered, at least with Claude, it
| writes out huge artifacts (static ones retrieved from the file
| system or wherever) character for character - What I'm going to
| try this weekend is just integrating a redis cache or sqlite into
| the MCP tool calls, so claude doesnt have to write everything out
| character per character... no idea if it will work as expected...
|
| also looking into "fire and forget" tools, to see even if that is
| possible
| mehdibl wrote:
| You don't have to use full write.
|
| Use grep & edit lines. and sequences instead of full files.
|
| This way you can edit files with 50kl loc without issue while
| Claude will blow out if you ever try to write such file.
| mehdibl wrote:
| The issue is not in function calls but HOW MCP got designed here
| and you are using.
|
| Most MCP are replicating API. Returning blobs of data.
|
| 1. This is using a lot of input context in formating as JSON and
| escaping a Json inside already a JSON. 2. This contain a lot of
| irrelevant information that you can same on it.
|
| So the issue is the MCP tool. It should instead flaten the data
| as possible as it's going back again thru JSON Encoding. And if
| needed remove some fields.
|
| So MCP SAAS here are mainly API gateways.
|
| That brings this noise! And most of ALL they are not optimizing
| MCP's.
| CSMastermind wrote:
| LLMs clearly struggle when presented with JSON, especially large
| amounts of it.
|
| There's nothing stopping your endpoints from returning data in
| some other format. LLMs actually seem to excel with XML for
| instance. But you could just use a template to define some
| narrative text.
| ryoshu wrote:
| I'm consistently surprised that people don't use XML for LLMs
| as the default given XML comes with built-in semantic context.
| Convert the XML to JSON output deterministically when you need
| to feed it to other pipelines.
| iJohnDoe wrote:
| Any reason for this for my own learning? Was XML more prevalent
| during training? Something better about XML that makes it
| easier for the LLM to work with?
|
| XML seems more text heavy, more tokens. However, maybe more
| context helps?
| bguberfain wrote:
| I think that there may be another solution for this, that is the
| LLM write a valid code that calls the MCP's as functions. See it
| like a Python script, where each MCP is mapped to a function. A
| simple example: def process(param1, param2):
| my_data = mcp_get_data(param1) sorted_data =
| mcp_sort(my_data, by=param2) return sorted_data
| jngiam1 wrote:
| Yes! If you want to see how this can work in practice, check
| out https://lutra.ai ; we've been using a similar pattern
| there. The challenge is making the code runtime work well for
| it.
| padjo wrote:
| Sorry I've been out of the industry for the last year or so, is
| this madness really what people are doing now?
| _se wrote:
| No, not most people. But some people are experimenting.
|
| No one has found anything revolutionary yet, but there are some
| useful applications to be sure.
| norcalkc wrote:
| > Allowing an execution environment to also access MCPs, tools,
| and user data requires careful design to where API keys are
| stored, and how tools are exposed.
|
| If your tools are calling APIs on-behalf of users, it's better to
| use OAuth flows to enable users of the app to give explicit
| consent to the APIs/scopes they want the tools to access. That
| way, tools use scoped tokens to make calls instead of hard to
| manage, maintain API keys (or even client credentials).
| iandanforth wrote:
| Do you know of any examples which use MCP and oauth cleanly?
| darkteflon wrote:
| We've been using smolagents, which takes this approach, and are
| impressed.
|
| Slight tangent, but as a long term user of OpenAI models, I was
| surprised at how well Claude Sonnet 3.7 through the desktop app
| handles multi-hop problem solving using tools (over MCP). As long
| as tool descriptions are good, it's quite capable of chaining and
| "lateral thinking" without any customisation of the system or
| user prompts.
|
| For those of you using Sonnet over API: is this behaviour similar
| there out of the box? If not, does simply pasting the recently
| exfiltrated[1] "agentic" prompt into the API system prompt get
| you (most of the way) there?
|
| [1] https://news.ycombinator.com/item?id=43909409
| 3abiton wrote:
| How does it compare to MCP servers?
| arjunchint wrote:
| I am kind of confused why can't you just create a new MCP tool
| that encapsulates parsing and other required steps together in a
| code block?
|
| This would be more reliable than expecting the LLM to generate
| working code 100% of the time?
| Centigonal wrote:
| You should for sure do this for common post processing tasks.
| However, you're usually not going to know all the types of
| post-processing users will want to do with tool call output at
| design-time.
| darkteflon wrote:
| What are the current best options for sandboxed execution
| environments? HuggingFace seems to have a tie-up with E2B,
| although by default smolagents runs something ephemeral in-
| process. I feel like there must be a good Docker container
| solution to this that doesn't require signing up to yet another
| SaaS. Any recommendations?
| colonCapitalDee wrote:
| Try gVisor
| iLoveOncall wrote:
| That's MCP for you.
|
| MCP is literally just a wrapper around an API call, but because
| it has some LLM buzz sprinkled on top, people expect it to do
| some magic, when they wouldn't expect the same magic from the
| underlying API.
| stavros wrote:
| I would really like to see output-aware LLM inference engines.
| For example, imagine if the LLM output some tokens that meant
| "I'm going to do a tool call now", and the inference engine (e.g.
| llama.cpp) changed the grammar on the fly so the next token could
| only be valid for the available tools.
|
| Or, if I gave the LLM a list of my users and asked it to filter
| based on some criteria, the grammar would change to only output
| user IDs that existed in my list.
|
| I don't know how useful this would be in practice, but at least
| it would make it impossible for the LLM to hallucinate for these
| cases.
___________________________________________________________________
(page generated 2025-05-21 23:00 UTC)