[HN Gopher] Show HN: Muscle-Mem, a behavior cache for AI agents
___________________________________________________________________
Show HN: Muscle-Mem, a behavior cache for AI agents
Hi HN! Erik here from Pig.dev, and today I'd like to share a new
project we've just open sourced: Muscle Mem is an SDK that records
your agent's tool-calling patterns as it solves tasks, and will
deterministically replay those learned trajectories whenever the
task is encountered again, falling back to agent mode if edge cases
are detected. Like a JIT compiler, for behaviors. At Pig, we built
computer-use agents for automating legacy Windows applications
(healthcare, lending, manufacturing, etc). A recurring theme we
ran into was that businesses _already_ had RPA (pure-software
scripts), and it worked for them in most cases. The pull to agents
as an RPA alternative was _not_ to have an infinitely flexible "AI
Employees" as tech Twitter/X may want you to think, but simply
because their RPA breaks under occasional edge-cases and agents can
gracefully handle those cases. Using a pure-agent approach proved
to be highly wasteful. Window's accessibility APIs are poor, so
you're generally stuck using pure-vision agents, which can run
around $40/hr in token costs and take 5x longer than a human to
perform a workflow. At this point, you're better off hiring a
human. The goal of Muscle-Mem is to get LLMs out of the hot path
of repetitive automations, intelligently swapping between script-
based execution for repeat cases, and agent-based automations for
discovery and self-healing. While inspired by computer-use
environments, Muscle Mem is designed to generalize to any
automation performing discrete tasks in dynamic environments. It
took a great deal of thought to figure out an API that generalizes,
which I cover more deeply in this blog:
https://erikdunteman.com/blog/muscle-mem/ Check out the repo,
consider giving it a star, or dive deeper into the above blog. I
look forward to your feedback!
Author : edunteman
Score : 108 points
Date : 2025-05-14 19:38 UTC (3 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| dbish wrote:
| Do you see these trajectories being used to fine tune a model
| automatically in some way rather then just replay, that way
| similar workflows might be improved too?
| edunteman wrote:
| I believe explicit trajectories for learned behavior are
| significantly easier for humans to grok and debug, in contrast
| to reinforcement learning methods like deep Q-learning, so
| avoiding the use of models is ideal, but I imagine they'll have
| their place.
|
| For what that may look like, I'll reuse a brainstorm on this
| topic that a friend gave me recently:
|
| "Instead of relying on an LLM to understand where to click, the
| click area itself is the token. And the click is a token and
| the objective is a token and the output is whatever. Such that,
| click paths aren't "stored", they're embedded within the
| training of the LAM/LLM"
|
| Whatever it ends up looking like, as long as it gets the job
| done and remains debuggable and extensible enough to not
| immediately eject a user once they hit any level of complexity,
| I'd be happy for it to be a part of Muscle Mem.
| web-cowboy wrote:
| This seems like a much more powerful version of what I wanted MCP
| "prompts" to be - and I'm curious to know if I have that right.
|
| For me, I want to reduce the friction on repeatable tasks. For
| example, I often need to create a new GraphQL query, which also
| requires updating the example query collection, creating a basic
| new integration test, etc. If I had a MCP-accessible prompt, I
| hoped the agent would realize I have a set of instructions on how
| to handle this request when I make it.
| edunteman wrote:
| In a way, a Muscle Mem trajectory is just a new "meta tool"
| that combines sequential use of other tools, with parameters
| that flow through it all.
|
| One form factor I toyed with was the idea of a dynamically
| evolving list of tool specs given to a model or presented from
| an MCP server, but I wasn't thrilled by:
|
| - that'd still require a model in the loop to choose the tool,
| albeit just once for the whole trajectory vs every step
|
| - it misses the sneaky challenge of Muscle Memory systems,
| which is continuous cache validation. An environment can change
| unexpectedly mid-trajectory and a system would need to adapt,
| so no matter what, it needs something that looks like Muscle
| Mem's Check abstraction for pre/post step cache validation
| ivanovm wrote:
| would just caching llm responses work here?
| huevosabio wrote:
| I love the idea!
|
| I think the problem will be defining wether there is a cache-hit
| or not, since "agents" are loosely defined and the tasks include
| basically anything.
| edunteman wrote:
| I agree, Cache Validation is the singular concern of Muscle
| Mem.
|
| If you boil it down, for a generic enough task and environment,
| the engine is just a database of previous environments and a
| user-provided filter function for cache validation
| joshstrange wrote:
| This is a neat idea and is similar to something I've been turning
| over in my head. LLMs are very powerful for taking a bunch of
| disparate tools/information/etc and generating good results but
| the speed is a big issue as well as reproducibility.
|
| I keep imagining an Agent that writes a bunch of custom tools
| when it needs it and "saves" them for later use. Creating
| pipelines in code/config that it can reuse instead of solving
| from 0 each time.
|
| Essentially, I want to use LLM for what they are good for (edge
| cases, fuzzy instructions/data) and have it turn around to write
| reusable tools so that the next time it doesn't have to run the
| full LLM, it can use a tiny LLM router up front to determine if
| there exists a tool to do this already. I'm not talking about MCP
| (though that is cool), this would use MCP tools but it could make
| new ones from the existing.
|
| Here is an example.
|
| Imagine I have an Agent with MCP tools to read/write to my email,
| calendar, ticketing system, and slack. I can ask the LLM to slack
| me every morning with an overview of my events for the day and
| anything outstanding I need to address. Maybe the first pass uses
| a frontier model to determine which tools to use and it
| accomplishes this task. Once I'm happy with the output then the
| Agent feeds the conversation/tool calls into another LLM to
| distill it to a Python/Node/Bash/whatever script. That script
| would call the MCP tools to do the same thing and use small LLMs
| to glue the results together and then it creates a cron (or
| similar) entry to have that run every morning.
|
| I feel like this would remove a ton of the uncertainty when it
| comes to which tools an LLM uses without requiring humans to
| write custom flows with limited tools available for each task.
|
| So the first pass would be: User: Please check
| my email, calendar, and slack for what I need to focus on today.
| LLM: Tool Call: Read Unread Email LLM: Tool
| Call: Read last 7 days of emails the user replied to
| LLM: Tool Call: Read this week's events from calendar
| LLM: Tool Call: Read unread slack messages LLM: Tool
| Call: Read tickets in this sprint LLM: Tool Call:
| Read unread comments on tickets assigned to me
| LLM: Tool Call: Read slack messages conversations from yesterday
| LLM: Please use the following data to determine what the user
| needs to focus on today: <Inject context from tool calls>
| LLM: It looks like have 3 meetings today at.....
|
| Then a fresh LLM reviews that and writes a script to do all the
| tool calls and jump to the last "Please use the following data"
| prompt which can be reused (cron'd or just called when it makes
| sense).
|
| I might be way off-base and I don't work in the space (I just
| play around the edges) but this feels like a way to let agents
| "learn" and grow. I've just found that in practice you don't get
| good results from throwing all your tools at 1 big LLM with your
| prompt, you're better off limiting the tools and even creating
| compound tools for certain jobs you do over and over. I've found
| that lots of little tool calls add up and take a long time so a
| way for the agent to dynamically create tools from combining
| other tools seems like a huge win.
| DrNosferatu wrote:
| Wrote something similar into my rules - obtained mixed results.
|
| Curious how effective this is.
| hackgician wrote:
| accessibility (a11y) trees are super helpful for LLMs; we use
| them extensively in stagehand! the context is nice for browsers,
| since you have existing frameworks like
| selenium/playwright/puppeteer for actually acting on nodes in the
| a11y tree.
|
| what does that analog look like in more traditional computer use?
| ctoth wrote:
| There are a variety of accessibility frameworks from MSAA (old,
| windows-only) IA2, JAB, UIA (newer). NVDA from NV Access has an
| abstraction over these APIs to standardize gathering roles and
| other information from the matrix of a11y providers, though
| note the GPL license depending on how you want to use it.
| lherron wrote:
| Feels kinda like JIT compiling your agent prompts into code.
| Awesome concept, hope it pans out.
| dmos62 wrote:
| I love the minimal approach and general-use focus.
|
| If I understand correctly, the engine caches trajectories in the
| simplest way possible, so if you have a cached trajectory a-b-c,
| and you encounter c-b-d, there's no way to get a "partial" cache
| hit, right? As I'm wrapping my head around this, I'm thinking
| that the engine would have to be a great deal more complicated to
| be able to judge when it's a safe partial hit.
|
| Basically, I'm trying to imagine how applicable this approach
| could be to a significantly noisier environment.
| nico wrote:
| Very cool concept!
|
| I wish v0, lovable, bolt et al did something like this with their
| suggested prompts
|
| It's such a poor user experience to pick a template, wait 3-5min
| to generate and then be dumped randomly either on an incredibly
| refined prototype, or a ton of code that just errors out. And in
| either case, having no clue what to do next
| deepdarkforest wrote:
| Not sure if this can work. We played around with something
| similar too for computer use, but comparing embeddings to cache
| validate the starting position is super gray, no clear threshold.
| For example, the datetime on the bottom right changes. Or if it's
| an app with a database etc, it can change the embeddings
| arbitrarily. Also, you must do this in every step, because as you
| said, things might break at any point. I just don't see how you
| can reliably validate. If anything, if models are cheap, you
| could use another cheaper llm call to compare screenshots, or
| adjust the playwright/api script on the fly. We ended up writing
| up a quite different approach that worked surprisingly well.
|
| There are definitely a lot of potential solutions, I'm curious
| where this goes. IMO an embeddings approach won't be enough. I'm
| more than happy to discuss what we did internally to achieve a
| decent rate though, the space is super promising for sure.
| arathis wrote:
| Hey, working on a personal project. Would love to dig into how
| you approached this.
| allmathl wrote:
| > At Pig, we built computer-use agents for automating legacy
| Windows applications (healthcare, lending, manufacturing, etc).
|
| How do you justify this vs fixing the software to enable
| scripting? That seems both cheaper and easier to achieve and with
| far higher yields. Assume market rate servicing of course.
|
| Plus; how do you force an "agent" to correct its behavior?
| nawgz wrote:
| Sorry, am I missing something? They obviously do not control
| source for these applications, but are able to gain access to
| whatever benefit the software originally had - reporting,
| tracking, API, whatever - by automating data entry tasks with
| AI.
|
| Legacy software is frequently useful but difficult to install
| and access.
| allmathl wrote:
| Deleted previous comment; made a new one.
|
| O, you're correct. This business is built on sand.
| primax wrote:
| I think if you got some experience in the industries this
| serves then you'd reconsider your opinion
| allmathl wrote:
| i don't see how improving the source could detriment
| anyone. The rest is just relationship details. The decent
| part about relationship details is imprisonment.
| Centigonal wrote:
| As nawgz said, the applications they are automating are
| often closed-source binary blobs. They _can 't_ enable
| scripting without some kind of RPA or AI agent.
___________________________________________________________________
(page generated 2025-05-14 23:00 UTC)