[HN Gopher] Metaflow: Build, Manage and Deploy AI/ML Systems
___________________________________________________________________
Metaflow: Build, Manage and Deploy AI/ML Systems
Author : plokker
Score : 112 points
Date : 2025-07-16 20:34 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| nxobject wrote:
| As a fun historical sidebar and an illustration that there are no
| new names in tech these days, Metaflow was also the name of the
| company that first introduced out-of-order speculative execution
| of CISC architectures using micro-ops. [1]
|
| [1] https://en.wikipedia.org/wiki/Metaflow_Technologies
| vtuulos wrote:
| I don't know if it's a coincidence but we just released a major
| new feature in Metaflow a few days ago - composing flows with
| custom decorators: https://docs.metaflow.org/metaflow/composing-
| flows/introduct...
|
| A big deal is that they get packaged automatically for remote
| execution. And you can attach them on the command line without
| touching code, which makes it easy to build pipelines with
| pluggable functionality - think e.g. switching an LLM provider on
| the fly.
|
| If you haven't looked into Metaflow recently, configuration
| management is another big feature that was contributed by the
| team at Netflix: https://netflixtechblog.com/introducing-
| configurable-metaflo...
|
| Many folks love the new native support for uv too:
| https://docs.metaflow.org/scaling/dependencies/uv
|
| I'm happy to answer any questions here
| theOGognf wrote:
| Is it common to see Metaflow used alongside MLflow if a team
| wants to track experiment data?
| vtuulos wrote:
| Metaflow tracks all artifacts and allows you to build
| dashboards with them, so there's no need to use MLFlow per
| se. There's a Metaflow integration in Weights and Biases,
| CometML etc, if you want pretty off-the-shelf dashboards
| lazarus01 wrote:
| I went to the GitHub page. The descriptions of the service seem
| redundant to what cloud providers offer today. I looked at the
| documentation and it lacks concrete examples for implementation
| flows.
|
| Seems like something new to learn, an added layer on top of
| existing workflows, with no obvious benefit.
| manojlds wrote:
| It's an old project from before the current AI buzz and I
| rejected this when I looked at it few years back as well with
| similar reasons.
|
| My opinion about Netflix OSS has been pretty low as well.
| datadrivenangel wrote:
| All the cloud providers have some hosted / custom version of an
| AI/ML deployment and training system. Good enough to use, janky
| enough to probably not meet all your needs if you're serious.
| lazarus01 wrote:
| I use google cloud for ML. AWS has a similar offering.
|
| I find google is purpose built for ml and provides tons of
| resources with excellent documentation.
|
| AWS feels like driving a double decker bus, very big and
| clunky, compared to google, which is a luxury sedan, that is
| quite comfortable to take you where you're going.
| vibecodemaster wrote:
| > redundant to what cloud providers offer today
|
| It may look redundant on the surface, but those cloud services
| are infrastructure primitives (compute, storage,
| orchestration). Metaflow sits one layer higher, giving you a
| data/model centric API that orchestrates and versions the
| entire workflow (code, data, environment, and lineage) while
| delegating the low-level plumbing to whatever cloud provider
| you choose. That higher-level abstraction is what lets the same
| Python flow run untouched on a laptop today and a K8s GPU
| cluster tomorrow.
|
| > Adds an extra layer to learn
|
| I would argue that it removes layers: you write plain Python
| functions, tag them as steps, and Metaflow handles scheduling,
| data movement, retry logic, versioning, and caching. You no
| longer glue together five different SDKs (batch + orchestration
| + storage + secrets + lineage).
|
| > lacks concrete examples for implementation flows
|
| there are examples in the tutorials:
| https://docs.outerbounds.com/intro-tutorial-season-3-overvie...
|
| > with no obvious benefit
|
| There are benefits, but perhaps they're not immediately
| obvious:
|
| 1) Separation of what vs. how: declare the workflow once;
| toggle @resources(cpu=4,gpu=1) to move from dev to a GPU
| cluster--no YAML rewrites.
|
| 2) Reproducibility & lineage: every run immutably stores code,
| data hashes, and parameters so you can reproduce any past model
| or report with flow resume --run-id.
|
| 3) Built-in data artifacts: pass or version GB-scale objects
| between steps without manually wiring S3 paths or serialization
| logic.
| anentropic wrote:
| I've been curious about this project for a while...
|
| If you squint a bit it's sort of like an Airflow that can run on
| AWS Step Functions.
|
| Step Functions sort of gives you fully serverless orchestration,
| which feels like a thing that should exist. But the process for
| authoring them is very cumbersome - they are crying out for a
| nice language level library i.e. for Python something that
| creates steps via decorator syntax.
|
| And it looks like Metaflow basically provides that (as well as
| for other backends).
|
| The main thing holding me back is lack of ecosystem. A big chunk
| of what I want to run on an orchestrator are things like dbt and
| dlt jobs, both of which have strong integrations for both Airflow
| and Dagster. Whereas Metaflow feels like not really on the radar,
| not widely used.
|
| Possibly I have got the wrong end of the stick a bit because
| Metaflow also provides an Airflow backend, which I sort of wonder
| in that case why bother with Metaflow?
| vtuulos wrote:
| Metaflow was started to address the needs of ML/AI projects
| whereas Airflow and Dagster started in data engineering.
|
| Consequently, a major part of Metaflow focuses on facilitating
| easy and efficient access to (large scale) compute - including
| dependency management - and local experimentation, which is out
| of scope for Airflow and Dagster.
|
| Metaflow has basic support for dbt and companies use it
| increasingly to power data engineering as AI is eating the
| world, but if you just need an orchestrator for ETL pipelines,
| Dagster is a great choice
|
| If you are curious to hear how companies navigate the question
| of Airflow vs Metaflow, see e.g this recent talk by Flexport
| https://youtu.be/e92eXfvaxU0
| kot-behemoth wrote:
| A while ago I saw a promising Clojure project stepwise [0]
| which sounds pretty close to what you're describing. It not
| only allows you to define steps in code, but also implements
| cool stuff like ability to write conditions, error statuses and
| resources in a much-less verbose EDN instead of JSON. It also
| supports code reloading and offloading large payloads to S3.
|
| Here's a nice article with code examples implementing a simple
| pipeline: https://www.quantisan.com/orchestrating-pizza-making-
| a-tutor....
|
| [0]: https://github.com/Motiva-AI/stepwise
| spieden wrote:
| Wow cool, a project I created got a mention on HN. :D
| coredog64 wrote:
| A few years back, the Step Functions team was soliciting input,
| and the Python thing was something that came up as a
| suggestion. It's hard, yes, but it should be possible to
| "Starlark" this and tell users that if you stick to this
| syntax, you can write Python and compile it down to native
| StepFunction syntax.
|
| Having said that, they have slightly improved the StepFunctions
| by adopting JSONata syntax.
| anentropic wrote:
| I don't think it should need Starlark or a restricted syntax.
|
| You just want some Python code that builds up a
| representation of the state machine, e.g. via decorating
| functions the same way that Celery, Dask, Airflow, Dagster et
| al have done for years.
|
| Then you have some other command to take that representation
| and generate the actual Step Functions JSON from it (and then
| deploy it etc).
|
| But the missing piece is that those other tools also
| explicitly give you a Python execution environment, so the
| function you're decorating is usually the 'task' function you
| want to run remotely.
|
| Whereas Step Functions doesn't provide compute itself, it
| mostly just gives you a way to execute AWS API calls. But the
| non control flow tasks in my Step Functions end up mostly
| being Lambda invoke steps to run my Python code.
|
| I'm currently authoring Step Functions via CDK. It is clunky
| AF.
|
| What it needs is some moderately opinionated layer on top.
|
| Someone at AWS did have a bit of an attempt here:
| https://aws-step-functions-data-science-
| sdk.readthedocs.io/e... but I'd really like to see something
| that goes further and smooths away a lot of the finickety
| JSON input arg/response wrangling. Also the local testing
| story (for Step Functions generally) is pretty meh.
| vtuulos wrote:
| If you are ok with executing your SFN steps on AWS Batch,
| Metaflow should do the job well. It's pretty inhuman to
| interact with SFN directly.
|
| One feature that's in our roadmap is the ability to define
| DAG fully programmatically, maybe through configs, so you
| will be able to have a custom representation -> SFN JSON,
| just using Metaflow as a compiler
| ShamblingMound wrote:
| Have been looking for an orchestrator for AI workflows including
| agentic workflows and this seemed to be the most promising (open
| source, free, can self-host, and supports dynamic workflows).
|
| But have not seen anyone talk about it in that context. What do
| people use for AI workflow orchestration (aside from langchain)?
| vtuulos wrote:
| Stay tuned! We have some cool new features coming soon to
| support agentic workloads (teaser:
| https://github.com/Netflix/metaflow/pull/2473)
|
| If you are curious, join the Metaflow Slack at
| http://slack.outerbounds.co and start a thread on #ask-metaflow
| awgl wrote:
| I've used Metaflow for the past 4 years or so on different ML
| teams. It's really great!
|
| Straightforward for data/ML scientists to pick up, familiar
| python class API for defining DAGs, and simplifies scaling out
| parallel jobs on AWS Batch (or k8s). The UI is pretty nice. Been
| happy to see the active development on it too.
|
| Currently using it at our small biotech startup to run thousands
| of protein engineering computations (including models like
| RFDiffusion, ProteinMPNN, boltz, AlphaFold, ESM, etc.).
|
| Data engineering focused DAG tools like Airflow are awkward for
| doing these kinds of ML computations, where we don't need the
| complexity of schedules, etc. Metaflow, imho, is also a step up
| from orchestration tools that were born out of bioinformatics
| groups, like Snakemake or Nextflow.
|
| Just a satisfied customer of Metaflow here. thx
| Bukhmanizer wrote:
| If you've tried, has it been clunky to run non-python based
| workflows? I.e if you want to run bedtools or diamond without
| having to run a bunch of subprocess.run commands?
| awgl wrote:
| Right, for most of our workflows, we stay in python land,
| which is great and seamless with Metaflow being in python.
| But yes, there are occasions that we have to make a system
| call to run an old R script or even a compiled C++ executable
| :shrug: (Metaflow does have some native R support tho) I have
| not had to use the specific tools you called out, bedtools or
| diamond.
|
| Most of the time this not a blocking problem since each step
| in a flow is mapped to a Docker image and/or your choice of
| EC2 instance (e.g. one step on a GPU, another on a memory
| optimized instance). You can have one step use an image with
| all of your python-based ML stuff, and another step have a
| different image with compiled exectuables that are triggered
| by a system call. If needed, outputs from such a system call
| would then need to be persisted in a database/S3 or read back
| into the python flow for persistence. So, it is not as
| seamless as a flow in all python, but it can work "good
| enough".
| LaserToy wrote:
| Cloudkitchens use them as well:
| https://techblog.cloudkitchens.com/p/ml-infrastructure-doesn...
|
| They call it a DREAM stack (Daft, Ray Engine or Ray and Poetry,
| Argo and Metaflow)
| vibecodemaster wrote:
| There's actually a lot of companies using Metaflow, big and
| small: https://outerbounds.com/stories
| apwell23 wrote:
| Netflix used to release so much good opensource software a decade
| ago. Now it seems to have fallen out of developer mindshare.
| Seems like the odd one out in FAANG in terms of tech and AI.
___________________________________________________________________
(page generated 2025-07-17 23:01 UTC)