[HN Gopher] Maestro: Netflix's Workflow Orchestrator
       ___________________________________________________________________
        
       Maestro: Netflix's Workflow Orchestrator
        
       Author : vquemener
       Score  : 135 points
       Date   : 2024-07-22 18:20 UTC (4 hours ago)
        
 (HTM) web link (netflixtechblog.com)
 (TXT) w3m dump (netflixtechblog.com)
        
       | halamadrid wrote:
       | Very nice, Netflix has a reputation of making great OSS products.
       | I wonder where does this stand with Conductor.
        
         | opiniateddev wrote:
         | Maestro is a domain specific implementation for ML and data
         | pipelines that uses Conductor as its core
         | 
         | https://netflixtechblog.com/orchestrating-data-ml-workflows-...
         | 
         | https://github.com/Netflix/maestro/blob/main/maestro-engine/...
        
       | iamsanteri wrote:
       | So will this serve as a stand-in replacement for something like
       | Airflow?
        
         | makestuff wrote:
         | Yeah, also curious if this is meant as a replacement for
         | Airflow.
        
       | pantsforbirds wrote:
       | This is a really great-looking project. I know I've considered
       | building (a probably worse) version of exactly this on almost
       | every mixed ML + Data Engineering project I've ever worked on.
       | 
       | I'm looking forward to testing it out.
        
       | oneplane wrote:
       | Looks a bit like Argo Workflows combined with Argo Events. Makes
       | sense to have so many projects and products converge around the
       | same endstate.
        
       | indiv0 wrote:
       | Is this meaningfully different from Conductor (which they
       | archived a while back)? Browsing through the code I see quite a
       | few similarities. Plus the use of JSON as the workflow definition
       | language.
        
         | opiniateddev wrote:
         | Conductor was moved here: https://github.com/conductor-
         | oss/conductor Maestro uses conductor as its core.
         | 
         | https://github.com/Netflix/maestro/blob/main/maestro-engine/...
         | 
         | https://netflixtechblog.com/orchestrating-data-ml-workflows-...
        
       | Sparkyte wrote:
       | Whats the difference of this and enqueue work into a queue then
       | waiting for a job to pick it up at a scheduled time? Not saying
       | build a Kafka cluster to serve this but most cloud providers have
       | queuing tools.
        
         | sjansen wrote:
         | Putting work in a queue is only the start. Most organizations
         | start there and gradually write ad hoc logic as they discover
         | problems like dependencies, retries, & scheduling.
         | 
         | Dependencies: what can be done in parallel and what must be
         | done in sequence? For example, three tasks get pushed in the
         | queue and only after all three finish a fourth task must be
         | run.
         | 
         | Retries: The concept is simple. The details are killer. For
         | example, ifa task fails, how long should the delay between
         | retries be? Too short and you create a retry storm. Forget to
         | add some jitter and you get thundering hoards all retrying at
         | the same time.
         | 
         | Scheduling: Because cron is good enough, until it isn't.
         | 
         | A good workflow solution provides battle tested versions of all
         | of the above. Better yet, a great workflow solution makes it
         | easier to keep business logic separate from plumbing so that
         | it's easier to reason about and test.
        
         | shawabawa3 wrote:
         | workflows typically involve chains of jobs with state
         | transitions, waits, triggers, error handling etc
         | 
         | a lot more than just e.g. celery jobs
        
         | nijave wrote:
         | A workflow manager implements a Choreography based saga pattern
         | https://microservices.io/patterns/data/saga.html
        
       | dboreham wrote:
       | Interesting. My team recently built a thing for managing long
       | running, multi-machine, restartable, cascading batch jobs in an
       | unrelated vehicle. Had no idea it was a category.
        
       | meliora245 wrote:
       | why would one consider this over something more established such
       | as Temporal, also I see Maestro is written in Java vs Temporal's
       | Go
        
         | iamspoilt wrote:
         | That's also my question.
        
         | robryan wrote:
         | Netflix also uses temporal: https://temporal.io/in-use/netflix
        
           | tiffanyh wrote:
           | Is Temporal still alive?
           | 
           | (website doesn't resolve for me)
           | 
           | EDIT: I found the GitHub page
           | 
           | https://github.com/temporalio/temporal
        
             | sjansen wrote:
             | The site loads fine for me.
             | 
             | See also: https://downforeveryoneorjustme.com/temporal.io
        
         | troebr wrote:
         | Didn't they rewrite some of Temporal's core in rust?
        
           | sjansen wrote:
           | They (re)wrote most of the client SDKs on a Rust core, but
           | the Temporal server is still written in Go.
        
         | aimazon wrote:
         | isn't Maestro an alternative to Airflow, not Temporal? Temporal
         | isn't a workflow orchestrator. There's some overlap on the
         | internals but they're different designs for different use
         | cases.
        
       | gtrubetskoy wrote:
       | The name Maestro has already been used for a workflow
       | orchestrator which I worked on back in 2016. That maestro is SQL-
       | centric and infers dependencies automatically by simply examining
       | the SQL. It's written in Go and is BigQuery-specific (but could
       | be easily adjusted to use any SQL-based system).
       | 
       | https://github.com/voxmedia/maestro/
        
         | stepanhruda wrote:
         | With all due respect, there are so many projects. They don't
         | care about clashing with a repo that has 12 stars and 14
         | commits.
        
           | nijave wrote:
           | Worked at a bank that named their container "cloud" platform
           | GCP and it was in no way related to Google _facepalm_
        
             | stavros wrote:
             | Well, if you're so unimaginative as to call your cloud
             | platform "<companyname> cloud platform", it's not the fault
             | of the second company whose name also starts with a G.
        
               | nijave wrote:
               | Worse, the G was Gaia (ironically the personification of
               | Earth in Greek mythology). They used "Gaia" as a name for
               | all their internal cloud platforms
        
       | tiffanyh wrote:
       | Don't see many Java projects being posted on HN.
        
         | xyst wrote:
         | We only upvote Go or Rust projects here ;)
        
       | jekude wrote:
       | Seems like they re-engineered Temporal: https://temporal.io/
        
         | troebr wrote:
         | They did use Temporal at Netflix, they gave a couple
         | presentations 2 years ago. I think this is very much not-
         | Temporal because it relies on a DSL instead of workflow as
         | code.
         | 
         | I don't know if it's a scale-thing, I'm not a workflow expert
         | but this seems more in line with the map-reduce of yore, as in
         | you get some big fat steps and you coordinate them, although
         | you could have coarse-grained activities in Temporal workflows.
         | 
         | I'd be curious to see what the tradeoffs are between the two
         | and if they still have usages for Temporal. Maybe Maestro is
         | better for less technical people? Latency? Scale?
        
       | hintymad wrote:
       | I wonder how many iterations we will need before engineers are
       | happy with a workflow solution. Netflix had multiple solutions
       | before Maestro, such as metaflow. Uber built multiple solutions
       | too. Amazon had at least a dozen internal workflow engines. It's
       | quite curious why engineers are so keen on building their own
       | workflow engines.
       | 
       | Update: I just find it really interesting that many individuals
       | in many companies like to build workflow engines. This is a not
       | deriding comment towards anyone or Netflix in particular. To me,
       | such observation is worth some friendly chitchat.
        
         | sgloutnikov wrote:
         | Naming things, cache invalidation, and workflow engines? :)
         | 
         | https://github.com/meirwah/awesome-workflow-engines
        
         | dinobones wrote:
         | We rolled our own workflow engine and it almost crashed one of
         | our unrelated projects for having so many bugs and being so
         | inflexible.
         | 
         | I'm starting to think workflow engines are somewhat of a design
         | smell.
         | 
         | It's enticing to think you can build this reusable thing once
         | and use it for a ton of different workflows, but besides
         | requiring more than one asynchronous step, these workflows have
         | almost nothing in common.
         | 
         | Different data, different APIs, different feedback required
         | from users or other systems to continue.
        
           | ryanianian wrote:
           | > workflow engines are somewhat of a design smell
           | 
           | Probably so, but the real design smell seems to be thinking
           | of a workflow engine as a panacea for sustainable business
           | process automation.
           | 
           | You have to really understand the business flow before you
           | automate it. You have to continuously update your
           | understanding of it as it changes. You have to refactor it
           | into sub-flows or bigger/smaller units of work. You have to
           | have tests, tracer-bullets, and well-defined user-stories
           | that the flows represent.
           | 
           | Else your business flow automation accumulates process debt.
           | Just as much as a full-code-based solution accumulates
           | technical debt.
           | 
           | And, just like technical debt, it's much easier (or at least
           | more interesting) to propose a rewrite or framework change
           | than it is to propose an investment in refactoring, testing,
           | and gradual migrations.
        
         | savin-goyal wrote:
         | Metaflow sits on top of Maestro, and neither replaces the other
         | 
         | > ...Users can use Metaflow library to create workflows in
         | Maestro to execute DAGs consisting of arbitrary Python code.
         | from https://netflixtechblog.com/orchestrating-data-ml-
         | workflows-...
         | 
         | The orchestration section in this article
         | (https://netflixtechblog.com/supporting-diverse-ml-systems-
         | at...) goes into detail on how Metaflow interplays with Maestro
         | (and Airflow, Argo Workflows & Step Functions)
        
         | dekhn wrote:
         | I wrote my own because I wanted to learn about DAG and toposort
         | and had some ideas about what nodes and edges in the workflow
         | meant (IE, does data flow over edges? Or do the edges just
         | represent the sequence in which things run? Is a node a bundle
         | of code, does it run continuously, or run then exit?). I almost
         | ended up with reflow, which is a functional-programming
         | approach based on python, similar to nextflow, but I found that
         | the whole functional approach to be extremely challenging to
         | reason about and debug.
         | 
         | Often times what happens is the workflow engine is tailored to
         | a specific problem and then other teams discover the engine and
         | want to use it for their projects, but often need some
         | additional feature, sometimes which completely up-ends the
         | mental model of the engine itself.
        
         | nijave wrote:
         | These things tend to be fairly complex and require lots of
         | integration with various services to get working. I think it's
         | a little more organic to start building something simple and
         | end up progressively adding more than implementing one from
         | scratch (unless there are people around with experience)
        
         | ilrwbwrkhv wrote:
         | Its because Netflix pretends to be a tech company to get the
         | high market cap.
         | 
         | So they hire tons of engineers who have nothing to do but
         | rearchitecture the mess their microservices have created.
         | 
         | Then there are others who create observability and test
         | harnesses for all of that.
         | 
         | When Pornhub and other porn sites can deliver orders of
         | magnitude more data across the world with much simpler systems,
         | you know it's all bullshit.
        
           | thfuran wrote:
           | >When Pornhub and other porn sites can deliver orders of
           | magnitude more data across the world with much simpler
           | systems, you know it's all bullshit
           | 
           | When is that, exactly?
           | https://www.statista.com/chart/15692/distribution-of-
           | global-...
        
             | exe34 wrote:
             | isn't it like 30%?
        
             | ATMLOTTOBEER wrote:
             | "Other" in your diagram is mostly porn
        
             | rty32 wrote:
             | What is the methodology of the report?
             | 
             | Just one of the questions I have regarding this -- China
             | has nearly 1.4 billion people, and barely any of them use
             | any of the services here. Instead, they have their own
             | video platforms. And you tell me that none of those
             | platforms use at least the same amount of traffic of Prime
             | Video? I doubt it.
        
         | alfalfasprout wrote:
         | The issue is that "workflow orchestration" is a broad problem
         | space. Companies need to address a lot of disparate issues and
         | so any solution ends up being a giant product w/ a lot of
         | associated functionality and heavily opinionated as it grows
         | into a big monolith. This is why almost universally folks are
         | never happy.
         | 
         | In reality there are five main concerns: 1. Resource
         | scheduling-- "I have a job or collection of jobs to run...
         | allocate them to the machines I have" 2. Dependency solving--
         | If my jobs have dependencies on each other, perform the
         | topological sort so I can dispatch things to my resource
         | scheduler 3. API/DSL for creating jobs and workflows. I want to
         | define a DAG... sometimes static, sometimes on the fly. 4.
         | Cron-like functionality. I want to be able to run things on a
         | schedule or ad-hoc. 5. Domain awareness-- If doing ETL I want
         | my DAGs to be data aware... if doing ML/AI workflows then I
         | want to be able to surface info about what I'm actually doing
         | with them
         | 
         | No one solution does all these things cleanly. So companies end
         | up building or hacking around off the shelf stuff to deal with
         | the downsides of existing solutions. Hence it's a perpetual
         | cycle of everyone being unhappy.
         | 
         | I don't think that you can just spin up a startup to deliver
         | this as a "solution". This needs to be solved with an open
         | source ecosystem of good pluggable modular components.
        
         | pm90 wrote:
         | It's likely because we haven't yet found a workflow
         | engine/orchestrator thats capable of handling diverse tasks
         | while still being easy to understand and operate.
         | 
         | It's really easy to build a custom workflow engine and optimize
         | it for specific use cases. I think we haven't yet seen a
         | convergence simply because this tool hasn't yet been built.
         | 
         | Consider the recent rise of tools that quickly dominated their
         | fields: Terraform (IaC), Kubernetes (distributed compute). Both
         | systems are hella complex, but they solve hard problems.
         | Generic workflow engines are complex to understand and
         | difficult to operate and offer a middling experience so many
         | folks don't even bother.
        
       | skywhopper wrote:
       | Advice: don't rely on any tool open-sourced by Netflix. They have
       | a long history of dropping support for things after they've
       | announced them. Someone got a checkmark on their promotion packet
       | by getting this blog post and code sharing out the door, but
       | don't build your business on a solution like this.
        
       | slt2021 wrote:
       | I used to be impressed with these corporate techblogs and their
       | internal proprietary systems, but not so much anymore. Because
       | code is a liability.
       | 
       | I would rather use off-the-shelf open source stuff with long
       | history of maintenance and improvement, rather than reinvent the
       | cron/celery/airflow/whatever, because code is a liability.
       | Somebody needs to maintain it, fix bugs, add new features. Unless
       | I get +1 grade promotion and salary/rsu bump, ofc.
       | 
       | People need to realize that code is a liability, anything that is
       | not the business critical stuff that earns/makes $$$ for the
       | company is a distraction and resource sink.
        
         | ripped_britches wrote:
         | 100%. Very few times are these systems built as robustly as
         | external folks who earn a profit on building robustness. Best
         | example of course being Stripe. But I see this from everything
         | from visual snapshot testing tools to custom CI workflows. The
         | good thing is you can always rely on competitive market
         | dynamics to price the off the shelf solution down to a
         | reasonable margin above maintenance costs.
        
         | jefurii wrote:
         | This sounds like the beginning of a sales pitch.
        
         | makeset wrote:
         | > anything that is not the business critical stuff That's an
         | important qualifier. For skilled teams in performance-critical
         | domains, the inflection point where any _outside_ code becomes
         | a low-quality /low-control liability is not that far.
        
         | YawningAngel wrote:
         | Off-the-shelf open source stuff is often the product of big
         | companies open sourcing internal tools though. Airflow, which
         | you name check, is a great example of this. Temporal is another
         | example in the space. _Someone_ has to be dumb enough to build
         | new stuff
        
         | bluepizza wrote:
         | > People need to realize that code is a liability
         | 
         | This is an extreme point of view, that is tightly connected to
         | the MBA-driven min-maxing of everything under the sun.
         | 
         | I am glad that there are folks who aren't afraid to code new
         | systems and champion new ideas. Even in the corporate sense,
         | mediocre risk averse solutions will only take you so far. The
         | most profitable companies tend to be quite daring in their
         | tech.
         | 
         | Code is not a liability. Code is what makes a company move its
         | gears.
        
           | delecti wrote:
           | Code being a liability is not a contradiction with code being
           | what makes a company move its gears. The trucks of a delivery
           | service are a liability (requiring maintenance, deprecation
           | accounting, fuel), but are also the only thing that lets the
           | company deliver. A delivery company should own as few trucks
           | as necessary, and no fewer. Any company should
           | publish/run/maintain as little code as necessary, and no
           | less.
        
         | alfalfasprout wrote:
         | I very much disagree with this take-- and the more I've
         | experienced throughout my career the more I'm sure of it.
         | 
         | Companies spend an IMMENSE amount of time and effort adapting
         | sometimes subpar off the shelf solutions to fit their infra and
         | pay an ongoing tax w/ increasing tech debt trying to support
         | them. Often something bespoke and smaller + more tailored would
         | unlock significantly more productivity _if_ the investment is
         | made consciously.
         | 
         | Any code that is written has both assets and liabilities. But
         | to claim it is a distraction and resource sink is a very, very
         | bad take. Every decision to build something in-house needs to
         | be done thoughtfully and deliberately.
        
         | bhawks wrote:
         | > with long history of maintenance and improvement,
         | 
         | That is a huge load bearing statement.
         | 
         | Do you plan on any contributions back to the community
         | yourself?
         | 
         | Build vs. buy is always an important conversation but claiming
         | that the 'buy'-side path has perfectly 0 maintenance and
         | reliability costs reeks of naivety.
        
           | bluepizza wrote:
           | Exactly. Code is a liability, therefore I will use code from
           | other people? How is that protecting me from any liabilities?
           | 
           | And thank you the companies sponsoring the open source
           | projects, who didn't have such a narrow view, and decided to
           | invest in an open product, instead of keeping it closed or
           | buying it proprietary due to the liability.
        
         | why-el wrote:
         | I am confused by this comment:
         | 
         | > open source stuff with long history of maintenance and
         | improvement
         | 
         | improvement and maintenance is continent on usage, and having
         | been used at Netflix, this project is in a better place to have
         | already faced whatever bug you are worried about (and let's be
         | real, 99% of applications wont ever get the luck to exercise
         | code paths sophisticated enough to find bugs Netflix has not
         | found already).
         | 
         | You might be unnecessarily projecting here. You don't have
         | evidence to support that open sourcing this might have been for
         | any other reason than it is simply good for the community to
         | have.
        
       | bjourne wrote:
       | What is a workflow in this context?
        
       | skissane wrote:
       | I'm a bit confused about what is going on here: This project
       | appears to use Netflix/conductor [0]. But you go to that repo,
       | you see it has been archived, with a message saying it is
       | replaced by Netflix's internal non-OSS version, and by
       | unmentioned community forks - by which I assume they mean Orkes
       | Conductor [1]. But this isn't using Orkes Conductor, it looks
       | like it is using the discontinued Netflix version
       | `com.netflix.conductor:conductor-core:2.31.5` [2] - and an
       | outdated version of it too.
       | 
       | [0] https://github.com/Netflix/conductor
       | 
       | [1] https://github.com/conductor-oss/conductor
       | 
       | [2]
       | https://github.com/Netflix/maestro/blob/e8bee3f1625d3f31d84d...
        
       ___________________________________________________________________
       (page generated 2024-07-22 23:02 UTC)