[HN Gopher] Metaflow: Build, Manage and Deploy AI/ML Systems
       ___________________________________________________________________
        
       Metaflow: Build, Manage and Deploy AI/ML Systems
        
       Author : plokker
       Score  : 112 points
       Date   : 2025-07-16 20:34 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | nxobject wrote:
       | As a fun historical sidebar and an illustration that there are no
       | new names in tech these days, Metaflow was also the name of the
       | company that first introduced out-of-order speculative execution
       | of CISC architectures using micro-ops. [1]
       | 
       | [1] https://en.wikipedia.org/wiki/Metaflow_Technologies
        
       | vtuulos wrote:
       | I don't know if it's a coincidence but we just released a major
       | new feature in Metaflow a few days ago - composing flows with
       | custom decorators: https://docs.metaflow.org/metaflow/composing-
       | flows/introduct...
       | 
       | A big deal is that they get packaged automatically for remote
       | execution. And you can attach them on the command line without
       | touching code, which makes it easy to build pipelines with
       | pluggable functionality - think e.g. switching an LLM provider on
       | the fly.
       | 
       | If you haven't looked into Metaflow recently, configuration
       | management is another big feature that was contributed by the
       | team at Netflix: https://netflixtechblog.com/introducing-
       | configurable-metaflo...
       | 
       | Many folks love the new native support for uv too:
       | https://docs.metaflow.org/scaling/dependencies/uv
       | 
       | I'm happy to answer any questions here
        
         | theOGognf wrote:
         | Is it common to see Metaflow used alongside MLflow if a team
         | wants to track experiment data?
        
           | vtuulos wrote:
           | Metaflow tracks all artifacts and allows you to build
           | dashboards with them, so there's no need to use MLFlow per
           | se. There's a Metaflow integration in Weights and Biases,
           | CometML etc, if you want pretty off-the-shelf dashboards
        
       | lazarus01 wrote:
       | I went to the GitHub page. The descriptions of the service seem
       | redundant to what cloud providers offer today. I looked at the
       | documentation and it lacks concrete examples for implementation
       | flows.
       | 
       | Seems like something new to learn, an added layer on top of
       | existing workflows, with no obvious benefit.
        
         | manojlds wrote:
         | It's an old project from before the current AI buzz and I
         | rejected this when I looked at it few years back as well with
         | similar reasons.
         | 
         | My opinion about Netflix OSS has been pretty low as well.
        
         | datadrivenangel wrote:
         | All the cloud providers have some hosted / custom version of an
         | AI/ML deployment and training system. Good enough to use, janky
         | enough to probably not meet all your needs if you're serious.
        
           | lazarus01 wrote:
           | I use google cloud for ML. AWS has a similar offering.
           | 
           | I find google is purpose built for ml and provides tons of
           | resources with excellent documentation.
           | 
           | AWS feels like driving a double decker bus, very big and
           | clunky, compared to google, which is a luxury sedan, that is
           | quite comfortable to take you where you're going.
        
         | vibecodemaster wrote:
         | > redundant to what cloud providers offer today
         | 
         | It may look redundant on the surface, but those cloud services
         | are infrastructure primitives (compute, storage,
         | orchestration). Metaflow sits one layer higher, giving you a
         | data/model centric API that orchestrates and versions the
         | entire workflow (code, data, environment, and lineage) while
         | delegating the low-level plumbing to whatever cloud provider
         | you choose. That higher-level abstraction is what lets the same
         | Python flow run untouched on a laptop today and a K8s GPU
         | cluster tomorrow.
         | 
         | > Adds an extra layer to learn
         | 
         | I would argue that it removes layers: you write plain Python
         | functions, tag them as steps, and Metaflow handles scheduling,
         | data movement, retry logic, versioning, and caching. You no
         | longer glue together five different SDKs (batch + orchestration
         | + storage + secrets + lineage).
         | 
         | > lacks concrete examples for implementation flows
         | 
         | there are examples in the tutorials:
         | https://docs.outerbounds.com/intro-tutorial-season-3-overvie...
         | 
         | > with no obvious benefit
         | 
         | There are benefits, but perhaps they're not immediately
         | obvious:
         | 
         | 1) Separation of what vs. how: declare the workflow once;
         | toggle @resources(cpu=4,gpu=1) to move from dev to a GPU
         | cluster--no YAML rewrites.
         | 
         | 2) Reproducibility & lineage: every run immutably stores code,
         | data hashes, and parameters so you can reproduce any past model
         | or report with flow resume --run-id.
         | 
         | 3) Built-in data artifacts: pass or version GB-scale objects
         | between steps without manually wiring S3 paths or serialization
         | logic.
        
       | anentropic wrote:
       | I've been curious about this project for a while...
       | 
       | If you squint a bit it's sort of like an Airflow that can run on
       | AWS Step Functions.
       | 
       | Step Functions sort of gives you fully serverless orchestration,
       | which feels like a thing that should exist. But the process for
       | authoring them is very cumbersome - they are crying out for a
       | nice language level library i.e. for Python something that
       | creates steps via decorator syntax.
       | 
       | And it looks like Metaflow basically provides that (as well as
       | for other backends).
       | 
       | The main thing holding me back is lack of ecosystem. A big chunk
       | of what I want to run on an orchestrator are things like dbt and
       | dlt jobs, both of which have strong integrations for both Airflow
       | and Dagster. Whereas Metaflow feels like not really on the radar,
       | not widely used.
       | 
       | Possibly I have got the wrong end of the stick a bit because
       | Metaflow also provides an Airflow backend, which I sort of wonder
       | in that case why bother with Metaflow?
        
         | vtuulos wrote:
         | Metaflow was started to address the needs of ML/AI projects
         | whereas Airflow and Dagster started in data engineering.
         | 
         | Consequently, a major part of Metaflow focuses on facilitating
         | easy and efficient access to (large scale) compute - including
         | dependency management - and local experimentation, which is out
         | of scope for Airflow and Dagster.
         | 
         | Metaflow has basic support for dbt and companies use it
         | increasingly to power data engineering as AI is eating the
         | world, but if you just need an orchestrator for ETL pipelines,
         | Dagster is a great choice
         | 
         | If you are curious to hear how companies navigate the question
         | of Airflow vs Metaflow, see e.g this recent talk by Flexport
         | https://youtu.be/e92eXfvaxU0
        
         | kot-behemoth wrote:
         | A while ago I saw a promising Clojure project stepwise [0]
         | which sounds pretty close to what you're describing. It not
         | only allows you to define steps in code, but also implements
         | cool stuff like ability to write conditions, error statuses and
         | resources in a much-less verbose EDN instead of JSON. It also
         | supports code reloading and offloading large payloads to S3.
         | 
         | Here's a nice article with code examples implementing a simple
         | pipeline: https://www.quantisan.com/orchestrating-pizza-making-
         | a-tutor....
         | 
         | [0]: https://github.com/Motiva-AI/stepwise
        
           | spieden wrote:
           | Wow cool, a project I created got a mention on HN. :D
        
         | coredog64 wrote:
         | A few years back, the Step Functions team was soliciting input,
         | and the Python thing was something that came up as a
         | suggestion. It's hard, yes, but it should be possible to
         | "Starlark" this and tell users that if you stick to this
         | syntax, you can write Python and compile it down to native
         | StepFunction syntax.
         | 
         | Having said that, they have slightly improved the StepFunctions
         | by adopting JSONata syntax.
        
           | anentropic wrote:
           | I don't think it should need Starlark or a restricted syntax.
           | 
           | You just want some Python code that builds up a
           | representation of the state machine, e.g. via decorating
           | functions the same way that Celery, Dask, Airflow, Dagster et
           | al have done for years.
           | 
           | Then you have some other command to take that representation
           | and generate the actual Step Functions JSON from it (and then
           | deploy it etc).
           | 
           | But the missing piece is that those other tools also
           | explicitly give you a Python execution environment, so the
           | function you're decorating is usually the 'task' function you
           | want to run remotely.
           | 
           | Whereas Step Functions doesn't provide compute itself, it
           | mostly just gives you a way to execute AWS API calls. But the
           | non control flow tasks in my Step Functions end up mostly
           | being Lambda invoke steps to run my Python code.
           | 
           | I'm currently authoring Step Functions via CDK. It is clunky
           | AF.
           | 
           | What it needs is some moderately opinionated layer on top.
           | 
           | Someone at AWS did have a bit of an attempt here:
           | https://aws-step-functions-data-science-
           | sdk.readthedocs.io/e... but I'd really like to see something
           | that goes further and smooths away a lot of the finickety
           | JSON input arg/response wrangling. Also the local testing
           | story (for Step Functions generally) is pretty meh.
        
             | vtuulos wrote:
             | If you are ok with executing your SFN steps on AWS Batch,
             | Metaflow should do the job well. It's pretty inhuman to
             | interact with SFN directly.
             | 
             | One feature that's in our roadmap is the ability to define
             | DAG fully programmatically, maybe through configs, so you
             | will be able to have a custom representation -> SFN JSON,
             | just using Metaflow as a compiler
        
       | ShamblingMound wrote:
       | Have been looking for an orchestrator for AI workflows including
       | agentic workflows and this seemed to be the most promising (open
       | source, free, can self-host, and supports dynamic workflows).
       | 
       | But have not seen anyone talk about it in that context. What do
       | people use for AI workflow orchestration (aside from langchain)?
        
         | vtuulos wrote:
         | Stay tuned! We have some cool new features coming soon to
         | support agentic workloads (teaser:
         | https://github.com/Netflix/metaflow/pull/2473)
         | 
         | If you are curious, join the Metaflow Slack at
         | http://slack.outerbounds.co and start a thread on #ask-metaflow
        
       | awgl wrote:
       | I've used Metaflow for the past 4 years or so on different ML
       | teams. It's really great!
       | 
       | Straightforward for data/ML scientists to pick up, familiar
       | python class API for defining DAGs, and simplifies scaling out
       | parallel jobs on AWS Batch (or k8s). The UI is pretty nice. Been
       | happy to see the active development on it too.
       | 
       | Currently using it at our small biotech startup to run thousands
       | of protein engineering computations (including models like
       | RFDiffusion, ProteinMPNN, boltz, AlphaFold, ESM, etc.).
       | 
       | Data engineering focused DAG tools like Airflow are awkward for
       | doing these kinds of ML computations, where we don't need the
       | complexity of schedules, etc. Metaflow, imho, is also a step up
       | from orchestration tools that were born out of bioinformatics
       | groups, like Snakemake or Nextflow.
       | 
       | Just a satisfied customer of Metaflow here. thx
        
         | Bukhmanizer wrote:
         | If you've tried, has it been clunky to run non-python based
         | workflows? I.e if you want to run bedtools or diamond without
         | having to run a bunch of subprocess.run commands?
        
           | awgl wrote:
           | Right, for most of our workflows, we stay in python land,
           | which is great and seamless with Metaflow being in python.
           | But yes, there are occasions that we have to make a system
           | call to run an old R script or even a compiled C++ executable
           | :shrug: (Metaflow does have some native R support tho) I have
           | not had to use the specific tools you called out, bedtools or
           | diamond.
           | 
           | Most of the time this not a blocking problem since each step
           | in a flow is mapped to a Docker image and/or your choice of
           | EC2 instance (e.g. one step on a GPU, another on a memory
           | optimized instance). You can have one step use an image with
           | all of your python-based ML stuff, and another step have a
           | different image with compiled exectuables that are triggered
           | by a system call. If needed, outputs from such a system call
           | would then need to be persisted in a database/S3 or read back
           | into the python flow for persistence. So, it is not as
           | seamless as a flow in all python, but it can work "good
           | enough".
        
       | LaserToy wrote:
       | Cloudkitchens use them as well:
       | https://techblog.cloudkitchens.com/p/ml-infrastructure-doesn...
       | 
       | They call it a DREAM stack (Daft, Ray Engine or Ray and Poetry,
       | Argo and Metaflow)
        
         | vibecodemaster wrote:
         | There's actually a lot of companies using Metaflow, big and
         | small: https://outerbounds.com/stories
        
       | apwell23 wrote:
       | Netflix used to release so much good opensource software a decade
       | ago. Now it seems to have fallen out of developer mindshare.
       | Seems like the odd one out in FAANG in terms of tech and AI.
        
       ___________________________________________________________________
       (page generated 2025-07-17 23:01 UTC)