[HN Gopher] Show HN: Muscle-Mem, a behavior cache for AI agents
       ___________________________________________________________________
        
       Show HN: Muscle-Mem, a behavior cache for AI agents
        
       Hi HN! Erik here from Pig.dev, and today I'd like to share a new
       project we've just open sourced:  Muscle Mem is an SDK that records
       your agent's tool-calling patterns as it solves tasks, and will
       deterministically replay those learned trajectories whenever the
       task is encountered again, falling back to agent mode if edge cases
       are detected. Like a JIT compiler, for behaviors.  At Pig, we built
       computer-use agents for automating legacy Windows applications
       (healthcare, lending, manufacturing, etc).  A recurring theme we
       ran into was that businesses _already_ had RPA (pure-software
       scripts), and it worked for them in most cases. The pull to agents
       as an RPA alternative was _not_ to have an infinitely flexible  "AI
       Employees" as tech Twitter/X may want you to think, but simply
       because their RPA breaks under occasional edge-cases and agents can
       gracefully handle those cases.  Using a pure-agent approach proved
       to be highly wasteful. Window's accessibility APIs are poor, so
       you're generally stuck using pure-vision agents, which can run
       around $40/hr in token costs and take 5x longer than a human to
       perform a workflow. At this point, you're better off hiring a
       human.  The goal of Muscle-Mem is to get LLMs out of the hot path
       of repetitive automations, intelligently swapping between script-
       based execution for repeat cases, and agent-based automations for
       discovery and self-healing.  While inspired by computer-use
       environments, Muscle Mem is designed to generalize to any
       automation performing discrete tasks in dynamic environments. It
       took a great deal of thought to figure out an API that generalizes,
       which I cover more deeply in this blog:
       https://erikdunteman.com/blog/muscle-mem/  Check out the repo,
       consider giving it a star, or dive deeper into the above blog. I
       look forward to your feedback!
        
       Author : edunteman
       Score  : 108 points
       Date   : 2025-05-14 19:38 UTC (3 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | dbish wrote:
       | Do you see these trajectories being used to fine tune a model
       | automatically in some way rather then just replay, that way
       | similar workflows might be improved too?
        
         | edunteman wrote:
         | I believe explicit trajectories for learned behavior are
         | significantly easier for humans to grok and debug, in contrast
         | to reinforcement learning methods like deep Q-learning, so
         | avoiding the use of models is ideal, but I imagine they'll have
         | their place.
         | 
         | For what that may look like, I'll reuse a brainstorm on this
         | topic that a friend gave me recently:
         | 
         | "Instead of relying on an LLM to understand where to click, the
         | click area itself is the token. And the click is a token and
         | the objective is a token and the output is whatever. Such that,
         | click paths aren't "stored", they're embedded within the
         | training of the LAM/LLM"
         | 
         | Whatever it ends up looking like, as long as it gets the job
         | done and remains debuggable and extensible enough to not
         | immediately eject a user once they hit any level of complexity,
         | I'd be happy for it to be a part of Muscle Mem.
        
       | web-cowboy wrote:
       | This seems like a much more powerful version of what I wanted MCP
       | "prompts" to be - and I'm curious to know if I have that right.
       | 
       | For me, I want to reduce the friction on repeatable tasks. For
       | example, I often need to create a new GraphQL query, which also
       | requires updating the example query collection, creating a basic
       | new integration test, etc. If I had a MCP-accessible prompt, I
       | hoped the agent would realize I have a set of instructions on how
       | to handle this request when I make it.
        
         | edunteman wrote:
         | In a way, a Muscle Mem trajectory is just a new "meta tool"
         | that combines sequential use of other tools, with parameters
         | that flow through it all.
         | 
         | One form factor I toyed with was the idea of a dynamically
         | evolving list of tool specs given to a model or presented from
         | an MCP server, but I wasn't thrilled by:
         | 
         | - that'd still require a model in the loop to choose the tool,
         | albeit just once for the whole trajectory vs every step
         | 
         | - it misses the sneaky challenge of Muscle Memory systems,
         | which is continuous cache validation. An environment can change
         | unexpectedly mid-trajectory and a system would need to adapt,
         | so no matter what, it needs something that looks like Muscle
         | Mem's Check abstraction for pre/post step cache validation
        
       | ivanovm wrote:
       | would just caching llm responses work here?
        
       | huevosabio wrote:
       | I love the idea!
       | 
       | I think the problem will be defining wether there is a cache-hit
       | or not, since "agents" are loosely defined and the tasks include
       | basically anything.
        
         | edunteman wrote:
         | I agree, Cache Validation is the singular concern of Muscle
         | Mem.
         | 
         | If you boil it down, for a generic enough task and environment,
         | the engine is just a database of previous environments and a
         | user-provided filter function for cache validation
        
       | joshstrange wrote:
       | This is a neat idea and is similar to something I've been turning
       | over in my head. LLMs are very powerful for taking a bunch of
       | disparate tools/information/etc and generating good results but
       | the speed is a big issue as well as reproducibility.
       | 
       | I keep imagining an Agent that writes a bunch of custom tools
       | when it needs it and "saves" them for later use. Creating
       | pipelines in code/config that it can reuse instead of solving
       | from 0 each time.
       | 
       | Essentially, I want to use LLM for what they are good for (edge
       | cases, fuzzy instructions/data) and have it turn around to write
       | reusable tools so that the next time it doesn't have to run the
       | full LLM, it can use a tiny LLM router up front to determine if
       | there exists a tool to do this already. I'm not talking about MCP
       | (though that is cool), this would use MCP tools but it could make
       | new ones from the existing.
       | 
       | Here is an example.
       | 
       | Imagine I have an Agent with MCP tools to read/write to my email,
       | calendar, ticketing system, and slack. I can ask the LLM to slack
       | me every morning with an overview of my events for the day and
       | anything outstanding I need to address. Maybe the first pass uses
       | a frontier model to determine which tools to use and it
       | accomplishes this task. Once I'm happy with the output then the
       | Agent feeds the conversation/tool calls into another LLM to
       | distill it to a Python/Node/Bash/whatever script. That script
       | would call the MCP tools to do the same thing and use small LLMs
       | to glue the results together and then it creates a cron (or
       | similar) entry to have that run every morning.
       | 
       | I feel like this would remove a ton of the uncertainty when it
       | comes to which tools an LLM uses without requiring humans to
       | write custom flows with limited tools available for each task.
       | 
       | So the first pass would be:                   User: Please check
       | my email, calendar, and slack for what I need to focus on today.
       | LLM: Tool Call: Read Unread Email                  LLM: Tool
       | Call: Read last 7 days of emails the user replied to
       | LLM: Tool Call: Read this week's events from calendar
       | LLM: Tool Call: Read unread slack messages              LLM: Tool
       | Call: Read tickets in this sprint              LLM: Tool Call:
       | Read unread comments on tickets assigned to me
       | LLM: Tool Call: Read slack messages conversations from yesterday
       | LLM: Please use the following data to determine what the user
       | needs to focus on today: <Inject context from tool calls>
       | LLM: It looks like have 3 meetings today at.....
       | 
       | Then a fresh LLM reviews that and writes a script to do all the
       | tool calls and jump to the last "Please use the following data"
       | prompt which can be reused (cron'd or just called when it makes
       | sense).
       | 
       | I might be way off-base and I don't work in the space (I just
       | play around the edges) but this feels like a way to let agents
       | "learn" and grow. I've just found that in practice you don't get
       | good results from throwing all your tools at 1 big LLM with your
       | prompt, you're better off limiting the tools and even creating
       | compound tools for certain jobs you do over and over. I've found
       | that lots of little tool calls add up and take a long time so a
       | way for the agent to dynamically create tools from combining
       | other tools seems like a huge win.
        
       | DrNosferatu wrote:
       | Wrote something similar into my rules - obtained mixed results.
       | 
       | Curious how effective this is.
        
       | hackgician wrote:
       | accessibility (a11y) trees are super helpful for LLMs; we use
       | them extensively in stagehand! the context is nice for browsers,
       | since you have existing frameworks like
       | selenium/playwright/puppeteer for actually acting on nodes in the
       | a11y tree.
       | 
       | what does that analog look like in more traditional computer use?
        
         | ctoth wrote:
         | There are a variety of accessibility frameworks from MSAA (old,
         | windows-only) IA2, JAB, UIA (newer). NVDA from NV Access has an
         | abstraction over these APIs to standardize gathering roles and
         | other information from the matrix of a11y providers, though
         | note the GPL license depending on how you want to use it.
        
       | lherron wrote:
       | Feels kinda like JIT compiling your agent prompts into code.
       | Awesome concept, hope it pans out.
        
       | dmos62 wrote:
       | I love the minimal approach and general-use focus.
       | 
       | If I understand correctly, the engine caches trajectories in the
       | simplest way possible, so if you have a cached trajectory a-b-c,
       | and you encounter c-b-d, there's no way to get a "partial" cache
       | hit, right? As I'm wrapping my head around this, I'm thinking
       | that the engine would have to be a great deal more complicated to
       | be able to judge when it's a safe partial hit.
       | 
       | Basically, I'm trying to imagine how applicable this approach
       | could be to a significantly noisier environment.
        
       | nico wrote:
       | Very cool concept!
       | 
       | I wish v0, lovable, bolt et al did something like this with their
       | suggested prompts
       | 
       | It's such a poor user experience to pick a template, wait 3-5min
       | to generate and then be dumped randomly either on an incredibly
       | refined prototype, or a ton of code that just errors out. And in
       | either case, having no clue what to do next
        
       | deepdarkforest wrote:
       | Not sure if this can work. We played around with something
       | similar too for computer use, but comparing embeddings to cache
       | validate the starting position is super gray, no clear threshold.
       | For example, the datetime on the bottom right changes. Or if it's
       | an app with a database etc, it can change the embeddings
       | arbitrarily. Also, you must do this in every step, because as you
       | said, things might break at any point. I just don't see how you
       | can reliably validate. If anything, if models are cheap, you
       | could use another cheaper llm call to compare screenshots, or
       | adjust the playwright/api script on the fly. We ended up writing
       | up a quite different approach that worked surprisingly well.
       | 
       | There are definitely a lot of potential solutions, I'm curious
       | where this goes. IMO an embeddings approach won't be enough. I'm
       | more than happy to discuss what we did internally to achieve a
       | decent rate though, the space is super promising for sure.
        
         | arathis wrote:
         | Hey, working on a personal project. Would love to dig into how
         | you approached this.
        
       | allmathl wrote:
       | > At Pig, we built computer-use agents for automating legacy
       | Windows applications (healthcare, lending, manufacturing, etc).
       | 
       | How do you justify this vs fixing the software to enable
       | scripting? That seems both cheaper and easier to achieve and with
       | far higher yields. Assume market rate servicing of course.
       | 
       | Plus; how do you force an "agent" to correct its behavior?
        
         | nawgz wrote:
         | Sorry, am I missing something? They obviously do not control
         | source for these applications, but are able to gain access to
         | whatever benefit the software originally had - reporting,
         | tracking, API, whatever - by automating data entry tasks with
         | AI.
         | 
         | Legacy software is frequently useful but difficult to install
         | and access.
        
           | allmathl wrote:
           | Deleted previous comment; made a new one.
           | 
           | O, you're correct. This business is built on sand.
        
             | primax wrote:
             | I think if you got some experience in the industries this
             | serves then you'd reconsider your opinion
        
               | allmathl wrote:
               | i don't see how improving the source could detriment
               | anyone. The rest is just relationship details. The decent
               | part about relationship details is imprisonment.
        
               | Centigonal wrote:
               | As nawgz said, the applications they are automating are
               | often closed-source binary blobs. They _can 't_ enable
               | scripting without some kind of RPA or AI agent.
        
       ___________________________________________________________________
       (page generated 2025-05-14 23:00 UTC)