[HN Gopher] Maestro: Netflix's Workflow Orchestrator
___________________________________________________________________
Maestro: Netflix's Workflow Orchestrator
Author : vquemener
Score : 135 points
Date : 2024-07-22 18:20 UTC (4 hours ago)
(HTM) web link (netflixtechblog.com)
(TXT) w3m dump (netflixtechblog.com)
| halamadrid wrote:
| Very nice, Netflix has a reputation of making great OSS products.
| I wonder where does this stand with Conductor.
| opiniateddev wrote:
| Maestro is a domain specific implementation for ML and data
| pipelines that uses Conductor as its core
|
| https://netflixtechblog.com/orchestrating-data-ml-workflows-...
|
| https://github.com/Netflix/maestro/blob/main/maestro-engine/...
| iamsanteri wrote:
| So will this serve as a stand-in replacement for something like
| Airflow?
| makestuff wrote:
| Yeah, also curious if this is meant as a replacement for
| Airflow.
| pantsforbirds wrote:
| This is a really great-looking project. I know I've considered
| building (a probably worse) version of exactly this on almost
| every mixed ML + Data Engineering project I've ever worked on.
|
| I'm looking forward to testing it out.
| oneplane wrote:
| Looks a bit like Argo Workflows combined with Argo Events. Makes
| sense to have so many projects and products converge around the
| same endstate.
| indiv0 wrote:
| Is this meaningfully different from Conductor (which they
| archived a while back)? Browsing through the code I see quite a
| few similarities. Plus the use of JSON as the workflow definition
| language.
| opiniateddev wrote:
| Conductor was moved here: https://github.com/conductor-
| oss/conductor Maestro uses conductor as its core.
|
| https://github.com/Netflix/maestro/blob/main/maestro-engine/...
|
| https://netflixtechblog.com/orchestrating-data-ml-workflows-...
| Sparkyte wrote:
| Whats the difference of this and enqueue work into a queue then
| waiting for a job to pick it up at a scheduled time? Not saying
| build a Kafka cluster to serve this but most cloud providers have
| queuing tools.
| sjansen wrote:
| Putting work in a queue is only the start. Most organizations
| start there and gradually write ad hoc logic as they discover
| problems like dependencies, retries, & scheduling.
|
| Dependencies: what can be done in parallel and what must be
| done in sequence? For example, three tasks get pushed in the
| queue and only after all three finish a fourth task must be
| run.
|
| Retries: The concept is simple. The details are killer. For
| example, ifa task fails, how long should the delay between
| retries be? Too short and you create a retry storm. Forget to
| add some jitter and you get thundering hoards all retrying at
| the same time.
|
| Scheduling: Because cron is good enough, until it isn't.
|
| A good workflow solution provides battle tested versions of all
| of the above. Better yet, a great workflow solution makes it
| easier to keep business logic separate from plumbing so that
| it's easier to reason about and test.
| shawabawa3 wrote:
| workflows typically involve chains of jobs with state
| transitions, waits, triggers, error handling etc
|
| a lot more than just e.g. celery jobs
| nijave wrote:
| A workflow manager implements a Choreography based saga pattern
| https://microservices.io/patterns/data/saga.html
| dboreham wrote:
| Interesting. My team recently built a thing for managing long
| running, multi-machine, restartable, cascading batch jobs in an
| unrelated vehicle. Had no idea it was a category.
| meliora245 wrote:
| why would one consider this over something more established such
| as Temporal, also I see Maestro is written in Java vs Temporal's
| Go
| iamspoilt wrote:
| That's also my question.
| robryan wrote:
| Netflix also uses temporal: https://temporal.io/in-use/netflix
| tiffanyh wrote:
| Is Temporal still alive?
|
| (website doesn't resolve for me)
|
| EDIT: I found the GitHub page
|
| https://github.com/temporalio/temporal
| sjansen wrote:
| The site loads fine for me.
|
| See also: https://downforeveryoneorjustme.com/temporal.io
| troebr wrote:
| Didn't they rewrite some of Temporal's core in rust?
| sjansen wrote:
| They (re)wrote most of the client SDKs on a Rust core, but
| the Temporal server is still written in Go.
| aimazon wrote:
| isn't Maestro an alternative to Airflow, not Temporal? Temporal
| isn't a workflow orchestrator. There's some overlap on the
| internals but they're different designs for different use
| cases.
| gtrubetskoy wrote:
| The name Maestro has already been used for a workflow
| orchestrator which I worked on back in 2016. That maestro is SQL-
| centric and infers dependencies automatically by simply examining
| the SQL. It's written in Go and is BigQuery-specific (but could
| be easily adjusted to use any SQL-based system).
|
| https://github.com/voxmedia/maestro/
| stepanhruda wrote:
| With all due respect, there are so many projects. They don't
| care about clashing with a repo that has 12 stars and 14
| commits.
| nijave wrote:
| Worked at a bank that named their container "cloud" platform
| GCP and it was in no way related to Google _facepalm_
| stavros wrote:
| Well, if you're so unimaginative as to call your cloud
| platform "<companyname> cloud platform", it's not the fault
| of the second company whose name also starts with a G.
| nijave wrote:
| Worse, the G was Gaia (ironically the personification of
| Earth in Greek mythology). They used "Gaia" as a name for
| all their internal cloud platforms
| tiffanyh wrote:
| Don't see many Java projects being posted on HN.
| xyst wrote:
| We only upvote Go or Rust projects here ;)
| jekude wrote:
| Seems like they re-engineered Temporal: https://temporal.io/
| troebr wrote:
| They did use Temporal at Netflix, they gave a couple
| presentations 2 years ago. I think this is very much not-
| Temporal because it relies on a DSL instead of workflow as
| code.
|
| I don't know if it's a scale-thing, I'm not a workflow expert
| but this seems more in line with the map-reduce of yore, as in
| you get some big fat steps and you coordinate them, although
| you could have coarse-grained activities in Temporal workflows.
|
| I'd be curious to see what the tradeoffs are between the two
| and if they still have usages for Temporal. Maybe Maestro is
| better for less technical people? Latency? Scale?
| hintymad wrote:
| I wonder how many iterations we will need before engineers are
| happy with a workflow solution. Netflix had multiple solutions
| before Maestro, such as metaflow. Uber built multiple solutions
| too. Amazon had at least a dozen internal workflow engines. It's
| quite curious why engineers are so keen on building their own
| workflow engines.
|
| Update: I just find it really interesting that many individuals
| in many companies like to build workflow engines. This is a not
| deriding comment towards anyone or Netflix in particular. To me,
| such observation is worth some friendly chitchat.
| sgloutnikov wrote:
| Naming things, cache invalidation, and workflow engines? :)
|
| https://github.com/meirwah/awesome-workflow-engines
| dinobones wrote:
| We rolled our own workflow engine and it almost crashed one of
| our unrelated projects for having so many bugs and being so
| inflexible.
|
| I'm starting to think workflow engines are somewhat of a design
| smell.
|
| It's enticing to think you can build this reusable thing once
| and use it for a ton of different workflows, but besides
| requiring more than one asynchronous step, these workflows have
| almost nothing in common.
|
| Different data, different APIs, different feedback required
| from users or other systems to continue.
| ryanianian wrote:
| > workflow engines are somewhat of a design smell
|
| Probably so, but the real design smell seems to be thinking
| of a workflow engine as a panacea for sustainable business
| process automation.
|
| You have to really understand the business flow before you
| automate it. You have to continuously update your
| understanding of it as it changes. You have to refactor it
| into sub-flows or bigger/smaller units of work. You have to
| have tests, tracer-bullets, and well-defined user-stories
| that the flows represent.
|
| Else your business flow automation accumulates process debt.
| Just as much as a full-code-based solution accumulates
| technical debt.
|
| And, just like technical debt, it's much easier (or at least
| more interesting) to propose a rewrite or framework change
| than it is to propose an investment in refactoring, testing,
| and gradual migrations.
| savin-goyal wrote:
| Metaflow sits on top of Maestro, and neither replaces the other
|
| > ...Users can use Metaflow library to create workflows in
| Maestro to execute DAGs consisting of arbitrary Python code.
| from https://netflixtechblog.com/orchestrating-data-ml-
| workflows-...
|
| The orchestration section in this article
| (https://netflixtechblog.com/supporting-diverse-ml-systems-
| at...) goes into detail on how Metaflow interplays with Maestro
| (and Airflow, Argo Workflows & Step Functions)
| dekhn wrote:
| I wrote my own because I wanted to learn about DAG and toposort
| and had some ideas about what nodes and edges in the workflow
| meant (IE, does data flow over edges? Or do the edges just
| represent the sequence in which things run? Is a node a bundle
| of code, does it run continuously, or run then exit?). I almost
| ended up with reflow, which is a functional-programming
| approach based on python, similar to nextflow, but I found that
| the whole functional approach to be extremely challenging to
| reason about and debug.
|
| Often times what happens is the workflow engine is tailored to
| a specific problem and then other teams discover the engine and
| want to use it for their projects, but often need some
| additional feature, sometimes which completely up-ends the
| mental model of the engine itself.
| nijave wrote:
| These things tend to be fairly complex and require lots of
| integration with various services to get working. I think it's
| a little more organic to start building something simple and
| end up progressively adding more than implementing one from
| scratch (unless there are people around with experience)
| ilrwbwrkhv wrote:
| Its because Netflix pretends to be a tech company to get the
| high market cap.
|
| So they hire tons of engineers who have nothing to do but
| rearchitecture the mess their microservices have created.
|
| Then there are others who create observability and test
| harnesses for all of that.
|
| When Pornhub and other porn sites can deliver orders of
| magnitude more data across the world with much simpler systems,
| you know it's all bullshit.
| thfuran wrote:
| >When Pornhub and other porn sites can deliver orders of
| magnitude more data across the world with much simpler
| systems, you know it's all bullshit
|
| When is that, exactly?
| https://www.statista.com/chart/15692/distribution-of-
| global-...
| exe34 wrote:
| isn't it like 30%?
| ATMLOTTOBEER wrote:
| "Other" in your diagram is mostly porn
| rty32 wrote:
| What is the methodology of the report?
|
| Just one of the questions I have regarding this -- China
| has nearly 1.4 billion people, and barely any of them use
| any of the services here. Instead, they have their own
| video platforms. And you tell me that none of those
| platforms use at least the same amount of traffic of Prime
| Video? I doubt it.
| alfalfasprout wrote:
| The issue is that "workflow orchestration" is a broad problem
| space. Companies need to address a lot of disparate issues and
| so any solution ends up being a giant product w/ a lot of
| associated functionality and heavily opinionated as it grows
| into a big monolith. This is why almost universally folks are
| never happy.
|
| In reality there are five main concerns: 1. Resource
| scheduling-- "I have a job or collection of jobs to run...
| allocate them to the machines I have" 2. Dependency solving--
| If my jobs have dependencies on each other, perform the
| topological sort so I can dispatch things to my resource
| scheduler 3. API/DSL for creating jobs and workflows. I want to
| define a DAG... sometimes static, sometimes on the fly. 4.
| Cron-like functionality. I want to be able to run things on a
| schedule or ad-hoc. 5. Domain awareness-- If doing ETL I want
| my DAGs to be data aware... if doing ML/AI workflows then I
| want to be able to surface info about what I'm actually doing
| with them
|
| No one solution does all these things cleanly. So companies end
| up building or hacking around off the shelf stuff to deal with
| the downsides of existing solutions. Hence it's a perpetual
| cycle of everyone being unhappy.
|
| I don't think that you can just spin up a startup to deliver
| this as a "solution". This needs to be solved with an open
| source ecosystem of good pluggable modular components.
| pm90 wrote:
| It's likely because we haven't yet found a workflow
| engine/orchestrator thats capable of handling diverse tasks
| while still being easy to understand and operate.
|
| It's really easy to build a custom workflow engine and optimize
| it for specific use cases. I think we haven't yet seen a
| convergence simply because this tool hasn't yet been built.
|
| Consider the recent rise of tools that quickly dominated their
| fields: Terraform (IaC), Kubernetes (distributed compute). Both
| systems are hella complex, but they solve hard problems.
| Generic workflow engines are complex to understand and
| difficult to operate and offer a middling experience so many
| folks don't even bother.
| skywhopper wrote:
| Advice: don't rely on any tool open-sourced by Netflix. They have
| a long history of dropping support for things after they've
| announced them. Someone got a checkmark on their promotion packet
| by getting this blog post and code sharing out the door, but
| don't build your business on a solution like this.
| slt2021 wrote:
| I used to be impressed with these corporate techblogs and their
| internal proprietary systems, but not so much anymore. Because
| code is a liability.
|
| I would rather use off-the-shelf open source stuff with long
| history of maintenance and improvement, rather than reinvent the
| cron/celery/airflow/whatever, because code is a liability.
| Somebody needs to maintain it, fix bugs, add new features. Unless
| I get +1 grade promotion and salary/rsu bump, ofc.
|
| People need to realize that code is a liability, anything that is
| not the business critical stuff that earns/makes $$$ for the
| company is a distraction and resource sink.
| ripped_britches wrote:
| 100%. Very few times are these systems built as robustly as
| external folks who earn a profit on building robustness. Best
| example of course being Stripe. But I see this from everything
| from visual snapshot testing tools to custom CI workflows. The
| good thing is you can always rely on competitive market
| dynamics to price the off the shelf solution down to a
| reasonable margin above maintenance costs.
| jefurii wrote:
| This sounds like the beginning of a sales pitch.
| makeset wrote:
| > anything that is not the business critical stuff That's an
| important qualifier. For skilled teams in performance-critical
| domains, the inflection point where any _outside_ code becomes
| a low-quality /low-control liability is not that far.
| YawningAngel wrote:
| Off-the-shelf open source stuff is often the product of big
| companies open sourcing internal tools though. Airflow, which
| you name check, is a great example of this. Temporal is another
| example in the space. _Someone_ has to be dumb enough to build
| new stuff
| bluepizza wrote:
| > People need to realize that code is a liability
|
| This is an extreme point of view, that is tightly connected to
| the MBA-driven min-maxing of everything under the sun.
|
| I am glad that there are folks who aren't afraid to code new
| systems and champion new ideas. Even in the corporate sense,
| mediocre risk averse solutions will only take you so far. The
| most profitable companies tend to be quite daring in their
| tech.
|
| Code is not a liability. Code is what makes a company move its
| gears.
| delecti wrote:
| Code being a liability is not a contradiction with code being
| what makes a company move its gears. The trucks of a delivery
| service are a liability (requiring maintenance, deprecation
| accounting, fuel), but are also the only thing that lets the
| company deliver. A delivery company should own as few trucks
| as necessary, and no fewer. Any company should
| publish/run/maintain as little code as necessary, and no
| less.
| alfalfasprout wrote:
| I very much disagree with this take-- and the more I've
| experienced throughout my career the more I'm sure of it.
|
| Companies spend an IMMENSE amount of time and effort adapting
| sometimes subpar off the shelf solutions to fit their infra and
| pay an ongoing tax w/ increasing tech debt trying to support
| them. Often something bespoke and smaller + more tailored would
| unlock significantly more productivity _if_ the investment is
| made consciously.
|
| Any code that is written has both assets and liabilities. But
| to claim it is a distraction and resource sink is a very, very
| bad take. Every decision to build something in-house needs to
| be done thoughtfully and deliberately.
| bhawks wrote:
| > with long history of maintenance and improvement,
|
| That is a huge load bearing statement.
|
| Do you plan on any contributions back to the community
| yourself?
|
| Build vs. buy is always an important conversation but claiming
| that the 'buy'-side path has perfectly 0 maintenance and
| reliability costs reeks of naivety.
| bluepizza wrote:
| Exactly. Code is a liability, therefore I will use code from
| other people? How is that protecting me from any liabilities?
|
| And thank you the companies sponsoring the open source
| projects, who didn't have such a narrow view, and decided to
| invest in an open product, instead of keeping it closed or
| buying it proprietary due to the liability.
| why-el wrote:
| I am confused by this comment:
|
| > open source stuff with long history of maintenance and
| improvement
|
| improvement and maintenance is continent on usage, and having
| been used at Netflix, this project is in a better place to have
| already faced whatever bug you are worried about (and let's be
| real, 99% of applications wont ever get the luck to exercise
| code paths sophisticated enough to find bugs Netflix has not
| found already).
|
| You might be unnecessarily projecting here. You don't have
| evidence to support that open sourcing this might have been for
| any other reason than it is simply good for the community to
| have.
| bjourne wrote:
| What is a workflow in this context?
| skissane wrote:
| I'm a bit confused about what is going on here: This project
| appears to use Netflix/conductor [0]. But you go to that repo,
| you see it has been archived, with a message saying it is
| replaced by Netflix's internal non-OSS version, and by
| unmentioned community forks - by which I assume they mean Orkes
| Conductor [1]. But this isn't using Orkes Conductor, it looks
| like it is using the discontinued Netflix version
| `com.netflix.conductor:conductor-core:2.31.5` [2] - and an
| outdated version of it too.
|
| [0] https://github.com/Netflix/conductor
|
| [1] https://github.com/conductor-oss/conductor
|
| [2]
| https://github.com/Netflix/maestro/blob/e8bee3f1625d3f31d84d...
___________________________________________________________________
(page generated 2024-07-22 23:02 UTC)