[HN Gopher] Ask HN: What is the simplest data orchestration tool...
       ___________________________________________________________________
        
       Ask HN: What is the simplest data orchestration tool you've worked
       with?
        
       Along the lines of Airflow, Prefect, Dagster, Argo, etc. What
       produced the least WTF per minute?
        
       Author : chordol
       Score  : 31 points
       Date   : 2025-03-21 19:24 UTC (3 hours ago)
        
       | PaulHoule wrote:
       | Straightforward programs in languages like Java, Python, etc.
       | 
       | The tools you describe all have the endpoint "you can't get there
       | from here" and the only difference is if it takes you 5 seconds,
       | 5 minutes, 5 days, 5 weeks or 5 months to learn that.
        
       | vitorbaptistaa wrote:
       | My experience entails:
       | 
       | * Luigi -- extensive usage (4y+)
       | 
       | * Makefiles -- (15y+)
       | 
       | * GitHub Actions -- (4y+)
       | 
       | * Airflow -- little usage (<6 months)
       | 
       | * Dagster -- very little, just trying it out
       | 
       | * Prefect -- just followed tutorial
       | 
       | Although it lacks a lot of the monitoring and advanced web ui
       | other platforms have (maybe because of it), Luigi is the simplest
       | to reason about IMHO.
       | 
       | For a new project that will require complex orchestrations, I'd
       | probably go with Dagster or Prefect nowadays. Dagster seems more
       | complex and more powerful with its data lineage functionality,
       | but I have very little experience with either tool.
       | 
       | If it's a simple project, a mix of Makefiles + GH Actions can
       | work well.
        
         | vector_spaces wrote:
         | Is there anything even more lightweight, where you don't have
         | to write your code any differently? For instance, say I have 10
         | jobs that don't depend on each other, all of them pretty small.
         | 
         | Dagster and even Luigi feel like overkill but I'd still like to
         | plug those into a unified interface where I can view previous
         | runs, mainly logs and exit codes. Being able to do some light
         | job configuration or add retries would be nice but not
         | required. For the moment I just use a logging handler that
         | writes to a database table and that's fine
        
           | disgruntledphd2 wrote:
           | I think that Airflow 2 implemented a decorator mode which you
           | can just use on functions.
           | 
           | Honestly, just use airflow, it has its issues but it sucks in
           | well known and predictable ways.
        
           | cicdw wrote:
           | One of the goals of Prefect's SDK is to be minimally invasive
           | from a code-standpoint (in the simplest case you only need
           | two lines to convert a script to a `flow`). Our deployment
           | model also makes infrastructure job config a first-class
           | citizen so you might have a good time trying it out.
           | (disclosure: work at Prefect)
        
       | recursive4 wrote:
       | Either Perfect or Dagster. FWIW, the Dagster team is actively
       | reducing the learning curve with each release.
        
       | rasmusab wrote:
       | Pure python scripts, maybe using the #%%-convention
       | (https://code.visualstudio.com/docs/python/jupyter-support-py...)
       | so you get the best of both notebooks and scripts, in a right-
       | sized instance/container/machine. And if you need to run jobs in
       | parallel, then orchestrate using make, like so:
       | https://www.sumsar.net/blog/makefile-recipe-python-data-pipe...
        
         | niwtsol wrote:
         | Yeah, I love this -- pure Python with cron or periodic tasks
         | (e.g., Django) works great. Celery task for parallelization,
         | and if you pipe logs/alerts into a Slack channel, you can
         | actually get really far without needing a "proper"
         | orchestration layer.
         | 
         | I recently took over an Airflow system from a former colleague,
         | and in our case, it's just overly complex for what's really a
         | pretty simple data flow.
        
       | myfakebadcode wrote:
       | I've been using airflow for quite some time. Due to the maturity
       | of where we are at, and while I've tested other solutions, I
       | don't really see changing things.
        
       | scary-size wrote:
       | We've migrated to Flyte. Mostly using the Java/Scala API which
       | can be a bit verbose. The official Python API is actually easy on
       | the eyes.
        
       | fforflo wrote:
       | Makefile with make2graph to visualize DAGs.
        
       | fmariluis wrote:
       | If you're inside AWS, have a fully containerized workflow and/or
       | can run some tasks in Lambda, Step Functions is probably ok? I
       | personally prefer Airflow, but I wouldn't say is the 'simplest
       | data orchestration tool'.
        
       | saturn8601 wrote:
       | I used to work for an automation company that produced a product
       | called ActiveBatch. It was such an amazing tool for just drag and
       | drop automation. Its focus was on full fledged workflow
       | automation and not just data orchestration.
       | 
       | What I loved was its simplicity + its out of the box features. To
       | set it up just took a simple MS SQL DB + An Installer. Bam you
       | are up and running an absolute rock solid scheduler(i've seen
       | million+ jobs running on it without it breaking a sweat). Then
       | you could install (or use it to deploy) execution agents to all
       | the servers you wanted as workers.
       | 
       | It also installed a robust Desktop GUI that had so many services
       | built in ready to go (anything from executing scripts all the way
       | to performing direct actions against countless products a company
       | would have or against various cloud services).
       | 
       | There were so many pre built actions where all you had to do was
       | input credentials and it would enumerate the appropriate
       | properties from that service automatically. Then you could
       | connect things together (ie, pull something from the cloud,
       | process it on some other server, store it, pass it along to
       | another service, whatever you wanted)
       | 
       | Only problem was this is very much a B2B application and their
       | sales is really only interested in selling to enterprises and not
       | end users. I really wish we had something like this that regular
       | people could download.
       | 
       | Everything ive seen listed here requires extensive setup,requires
       | coding, or does not have a robust desktop GUI but instead some
       | half baked web gui which might require dropping back down to
       | scripts/coding. You could set up hundreds/thousands of automated
       | steps in ActiveBatch without writing a single line of code. I
       | miss that product.
        
       | rich_sasha wrote:
       | I wrote my own in half a day. Worked 24/7 for 3 years... then I
       | quit.
       | 
       | Seriously, took me much less time than setting up airflow. Even
       | had a webpage in the end, with all the tasks, a tree view,
       | downstream, upstream tasks (these were incremental improvements
       | beyond the initial half-day), CLI... The works.
       | 
       | I now know the points of fragility I didn't know before, but I'd
       | do it again.
        
       | itfollowsthen wrote:
       | at my last startup I asked a friend to help me debug an Airflow
       | DAG. he just pip installed prefect and I've never really looked
       | back. at the time everything else felt too hard to figure out.
        
       | djsjajah wrote:
       | I few people have mentioned dagster and I took a look at that for
       | some machine learning things I was playing with but I found dvc
       | (data version control [1]) and I think it is fantastic. I think
       | it also has more applications than just machine learning but
       | really anything with data. If you have a bunch of shell scripts
       | that write to files to pass data around, then dvc might be a good
       | fit. it will do things like only rerun steps if it needs to. Also
       | for totally non-data stuff, Prefect is great.
       | 
       | [1] https://dvc.org
        
       | rubenfiszel wrote:
       | You should give a try to Windmill, it's more of a workflow engine
       | than a data orchestration tool but it's intuitive and open-
       | source.
        
       | speedgoose wrote:
       | I like having containers running as CronJobs or Deployments in
       | Kubernetes, but Argo Workflow has been a pretty reliable plugin
       | to Kubernetes for the more advanced scenarios.
       | 
       | However, it's simple only if you are already familiar with
       | software containers and Kubernetes. But it's perhaps better to
       | learn than having to deal with dependency hell in Python or Java.
        
       | tdeck wrote:
       | The simplest for sure was using ActiveJob in Rails with Clockwork
       | for scheduling and Postgres for queueing things up.
        
       ___________________________________________________________________
       (page generated 2025-03-21 23:02 UTC)