[HN Gopher] Show HN: Magnitude - open-source, AI-native test fra...
       ___________________________________________________________________
        
       Show HN: Magnitude - open-source, AI-native test framework for web
       apps
        
       Hey HN, Anders and Tom here - we've been building an end-to-end
       testing framework powered by visual LLM agents to replace
       traditional web testing.  We know there's a lot of noise about
       different browser agents. If you've tried any of them, you know
       they're slow, expensive, and inconsistent. That's why we built an
       agent specifically for running test cases and optimized it just for
       that:  - Pure vision instead of error prone "set-of-marks" system
       (the colorful boxes you see in browser-use for example)  - Use tiny
       VLM (Moondream) instead of OpenAI/Anthropic computer use for
       dramatically faster and cheaper execution  - Use two agents: one
       for planning and adapting test cases and one for executing them
       quickly and consistently.  The idea is the planner builds up a
       general plan which the executor runs. We can save this plan and re-
       run it with only the executor for quick, cheap, and consistent
       runs. When something goes wrong, it can kick back out to the
       planner agent and re-adjust the test.  It's completely open source.
       Would love to have more people try it out and tell us how we can
       make it great.  Repo: https://github.com/magnitudedev/magnitude
        
       Author : anerli
       Score  : 104 points
       Date   : 2025-04-25 17:00 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | grbsh wrote:
       | I know moondream is cheap / fast and can run locally, but is it
       | good enough? In my experience testing things like Computer Use,
       | anything but the large LLMs has been so unreliable as to be
       | unworkable. But maybe you guys are doing something special to
       | make it work well in concert?
        
         | anerli wrote:
         | So it's key to still have a big model that is devising the
         | overall strategy for executing the test case. Moondream on its
         | own is pretty limited and can't handle complex queries. The
         | planner gives very specific instructions to Moondream, which is
         | just responsible for locating different targets on the screen.
         | It's basically just the layer between the big LLM doing the
         | actual "thinking" and grounding that to specific UI
         | interactions.
         | 
         | Where it gets interesting, is that we can save the execution
         | plan that the big model comes up with and run with ONLY
         | Moondream if the plan is specific enough. Then switch back out
         | to the big model if some action path requires adjustment. This
         | means we can run repeated tests much more efficiently and
         | consistently.
        
           | grbsh wrote:
           | Ooh, I really like the idea about deciding whether to use the
           | big or small model based on task specificity.
        
             | tough wrote:
             | You might like https://pypi.org/project/llm-predictive-
             | router/
        
               | anerli wrote:
               | Oh this is interesting. In our case we are being very
               | specific about which types of prompts go where, so the
               | planner essentially creates prompts that will be executed
               | by Moondream, instead of trying to route prompts
               | generally to the appropriate model. The types of requests
               | that our planner agent vs Moondream can handle are
               | fundamentally different for our use case.
        
               | tough wrote:
               | interesting, will check out yours i'm mostly interested
               | on these dynamic routers so I can mix local and API based
               | depending on needs, i cannot run some models locally but
               | most of the tasks don't even require such power (on
               | building ai agentic systems)
               | 
               | there's also https://github.com/lm-sys/RouteLLM
               | 
               | and other similar
               | 
               | I guess your system is not as open-ended task oriented so
               | you can just build workflows deciding which model to use
               | at each step, these routing mechanisms are more useful
               | for open-ended tasks that dont fit on a workflow so well
               | (maybe?)
        
       | badmonster wrote:
       | How does Magnitude differentiate between the planner and executor
       | LLM roles, and how customizable are these components for specific
       | test flows?
        
         | anerli wrote:
         | So the prompts that are sent to the planner vs executor are
         | completely distinct. We allow complete customization of the
         | planner LLM with all major providers (Anthropic, OpenAI, Google
         | AI Studio, Google Vertex AI, AWS Bedrock, OpenAI compatible).
         | The executor LLM on the other hand has to fit very specific
         | criteria, so we only support the Moondream model right now. For
         | a model to act as the executor it needs to be able to specific
         | specific pixel coordinates (only a few models support this, for
         | example OpenAI/Anthropic computer use, Molmo, Moondream, and
         | some others). We like Moondream because its super tiny and fast
         | (2B). This means as long as we still have a "smart" planner LLM
         | we can have very fast/cheap execution and precise UI
         | interaction.
        
           | badmonster wrote:
           | does Moondream handle multi-step UI tasks reliably (like
           | opening a menu, waiting for render, then clicking), or do you
           | have to scaffold that logic separately in the planner?
        
             | anerli wrote:
             | The planner can plan out multiple web actions at once,
             | which Moondream can then execute in sequence on its own. So
             | Moondream is never deciding how to execute more than one
             | web action in a single prompt.
             | 
             | What this really means for developers writing the tests is
             | you don't really have to worry about it. A "step" in
             | Magnitude can map to any number of web actions dynamically
             | based on the description, and the agents will figure out
             | how to do it repeatably.
        
       | NitpickLawyer wrote:
       | > The idea is the planner builds up a general plan which the
       | executor runs. We can save this plan and re-run it with only the
       | executor for quick, cheap, and consistent runs. When something
       | goes wrong, it can kick back out to the planner agent and re-
       | adjust the test.
       | 
       | I've been recently thinking about testing/qa w/ VLMs + LLMs, one
       | area that I haven't seen explored (but should 100% be feasible)
       | is to have the first run be LLM + VLM, and then have the LLM(s?)
       | write repeatable "cheap" tests w/ traditional libraries
       | (playwright, puppeteer, etc). On every run you do the "cheap"
       | traditional checks, if any fail go with the LLM + VLM again and
       | see what broke, only fail the test if both fail. Makes sense?
        
         | anerli wrote:
         | So this is a path that we definitely considered. However we
         | think its a half-measure to generate actual Playwright code and
         | just run that. Because if you do that, you still have a brittle
         | test at the end of the day, and once it breaks you would need
         | to pull in some LLM to try and adapt it anyway.
         | 
         | Instead of caching actual code, we cache a "plan" of specific
         | web actions that are still described in natural language.
         | 
         | For example, a cached "typing" action might look like: {
         | variant: 'type'; target: string; content: string; }
         | 
         | The target is a natural language description. The content is
         | what to type. Moondream's job is simply to find the target, and
         | then we will click into that target and type whatever content.
         | This means it can be full vision and not rely on DOM at all,
         | while still being very consistent. Moondream is also trivially
         | cheap to run since it's only a 2B model. If it can't find the
         | target or it's confidence changed significantly (using token
         | probabilities), it's an indication that the action/plan
         | requires adjustment, and we can dynamically swap in the planner
         | LLM to decide how to adjust the test from there.
        
           | ekzy wrote:
           | Did you consider also caching the coordinates returned by
           | moondream? I understand that it is cheap, but it could be
           | useful to detect if an element has changed position as it may
           | be a regression
        
             | anerli wrote:
             | So the problem is if we cache the coordinates and click
             | blindly at the saved positions, there's no way to tell if
             | the interface changes or if we are actually clicking the
             | wring things (unless we try and do something hacky like
             | listen for events on the DOM). Detecting whether elements
             | have changed position though would definitely be feasible
             | if re-running a test with Moondream, could compared against
             | the coordinates of the last run.
        
       | tobr wrote:
       | Interesting! My first concern is - isn't this the ultimate non-
       | deterministic test? In practice, does it seem flaky?
        
         | anerli wrote:
         | So the architecture is built with determinism in mind. The
         | plan-caching system is still a work in progress, but especially
         | once fully implemented it should be very consistent. As long as
         | your interface doesn't change (or changes in trivial ways),
         | Moondream alone can execute the same exact web actions as
         | previous test runs without relying on any DOM selectors. When
         | the interface does eventually change, that's where it becomes
         | non-deterministic again by necessity, since the planner will
         | need to generatively update the test and continue building the
         | new cache from there. However once it's been adapted, it can
         | once again be executed that way every time until the interface
         | changes again.
        
           | daxfohl wrote:
           | In a way, nondeterminism could be an advantage. Instead of
           | using these as unit tests, use them as usability tests.
           | Especially if you want your site to be accessible by AI
           | agents, it would be good to have a way of knowing what tweaks
           | increase the success rate.
           | 
           | Of course that would be even more valuable for testing your
           | MCP or A2A services, but could be useful for UI as well. Or
           | it could be useless. It would be interesting to see if the
           | same UI changes affect both human and AI success rate in the
           | same way.
           | 
           | And if not, could an AI be trained to correlate more closely
           | to human behavior. That could be a good selling point if
           | possible.
        
             | anerli wrote:
             | Originally we were actually thinking about doing exactly
             | this and building agents for usability testing. However, we
             | think that LLMs are much better suited for tackling well
             | defined tasks rather than trying to emulate human nuance,
             | so we pivoted to end-to-end testing and figuring out how to
             | make LLM browser agents act deterministically.
        
       | dimal wrote:
       | > Pure vision instead of error prone "set-of-marks" system (the
       | colorful boxes you see in browser-use for example)
       | 
       | One benefit _not_ using pure vision is that it 's a strong signal
       | to developers to make pages accessible. This would let them off
       | the hook.
       | 
       | Perhaps testing _both_ paths separately would be more
       | appropriate. I could imagine a different AI agent attempting to
       | navigate the page through accessibility landmarks. Or even
       | different agents that simulate different types of disabilities.
        
         | anerli wrote:
         | Yeah good criticism for sure. We definitely want to keep this
         | in mind as we continue to build. Some kind of accessibility
         | tests which run in parallel with each visual test that are only
         | allowed to use the accessibility tree could make it much easier
         | for developers to identify how to address different
         | accessibility concerns.
        
       | jcmontx wrote:
       | Does it only work for node projects? Can I run it against a
       | Staging environment without mixing it with my project?
        
         | anerli wrote:
         | You can run it against any URL, not just node projects! You'll
         | still need a skeleton node project for the actual Magnitude
         | tests, but you could configure some other public or staging URL
         | as the target site.
        
       | pandemic_region wrote:
       | Bang me sideways, "AI-native" is a thing now? What does that even
       | mean?
        
         | Alifatisk wrote:
         | Had to look it up too! https://www.studioglobal.ai/blog/what-
         | is-ai-native
        
         | mcbuilder wrote:
         | It definitely means something, probably an app designed around
         | being interacted by with an LLM, upon first hearing it. Browser
         | interaction is one of those things that is a great killer app
         | for LLMs IMO.
         | 
         | For instance, I just discovered there are a ton of high quality
         | scans of film and slides available at the Library of Congress
         | website, but I don't really enjoy their interface. I could
         | build a scraping tool and get too much info, or suffer and use
         | just clicking through their search UI. Or I could ask my
         | browser tool wielding LLM agent to automate the boring stuff
         | and provide a map of the subjects I would be interested in, and
         | give me a different way to discover things. I've just
         | discovered the entire browser automation thing, and I'm having
         | fun have my LLM go "research" for a few minutes while I go do
         | something else.
        
         | anerli wrote:
         | Well yeah it's kind of ambiguous, it's just our way of saying
         | that we're trying to use AI to make testing easier!
        
       | SparkyMcUnicorn wrote:
       | This is pretty much exactly what I was going to build. It's
       | missing a few things, so I'll either be contributing or forking
       | this in the future.
       | 
       | I'll need a way to extract data as part of the tests, like
       | screenshots and page content. This will allow supplementing the
       | tests with non-magnitude features, as well as add things that are
       | a bit more deterministic. Assert that the added todo item exactly
       | matches what was used as input data, screenshot diffs when the
       | planner fallback came into play, execution log data, etc.
       | 
       | This isn't currently possible from what I can see in the docs,
       | but maybe I'm wrong?
       | 
       | It'd also be ideal if it had an LLM-free executor mode to reduce
       | costs and increase speed (caching outputs, or maybe use
       | accessibility tree instead of VLM), and also fit requirements
       | when the planner should not automatically kick in.
        
         | anerli wrote:
         | Hey, awesome to hear! We are definitely open to contributions
         | :)
         | 
         | We plan to (very soon) enable mixing standard Playwright or
         | other code in between Magnitude steps, which should enable
         | doing exact assertions or anything else you want to do.
         | 
         | Definitely understand the need to reduce costs / increase
         | speed, which mainly we think will be best enabled by our plan-
         | caching system that will get executed by Moondream (a 2B
         | model). Moondream is very fast and also has self-hosted
         | options. However there's no reason we couldn't potentially have
         | an option to generate pure Playwright for people who would
         | prefer to do that instead.
         | 
         | We have a discord as well if you'd like to easily stay in touch
         | about contributing: https://discord.gg/VcdpMh9tTy
        
       | sergiomattei wrote:
       | Hi, this looks great! Any plans to support Azure OpenAI as a
       | backend?
        
         | anerli wrote:
         | Hey! We can add this pretty easily! We find that Gemini Pro 2.5
         | works the best as the planner model by a good margin, but we
         | definitely want to support a variety of providers. I'll keep
         | this in mind and implement soon!
         | 
         | edit: tracking here
         | https://github.com/magnitudedev/magnitude/issues/6
        
       ___________________________________________________________________
       (page generated 2025-04-25 23:00 UTC)