[HN Gopher] Ask HN: What is the simplest data orchestration tool...
___________________________________________________________________
Ask HN: What is the simplest data orchestration tool you've worked
with?
Along the lines of Airflow, Prefect, Dagster, Argo, etc. What
produced the least WTF per minute?
Author : chordol
Score : 31 points
Date : 2025-03-21 19:24 UTC (3 hours ago)
| PaulHoule wrote:
| Straightforward programs in languages like Java, Python, etc.
|
| The tools you describe all have the endpoint "you can't get there
| from here" and the only difference is if it takes you 5 seconds,
| 5 minutes, 5 days, 5 weeks or 5 months to learn that.
| vitorbaptistaa wrote:
| My experience entails:
|
| * Luigi -- extensive usage (4y+)
|
| * Makefiles -- (15y+)
|
| * GitHub Actions -- (4y+)
|
| * Airflow -- little usage (<6 months)
|
| * Dagster -- very little, just trying it out
|
| * Prefect -- just followed tutorial
|
| Although it lacks a lot of the monitoring and advanced web ui
| other platforms have (maybe because of it), Luigi is the simplest
| to reason about IMHO.
|
| For a new project that will require complex orchestrations, I'd
| probably go with Dagster or Prefect nowadays. Dagster seems more
| complex and more powerful with its data lineage functionality,
| but I have very little experience with either tool.
|
| If it's a simple project, a mix of Makefiles + GH Actions can
| work well.
| vector_spaces wrote:
| Is there anything even more lightweight, where you don't have
| to write your code any differently? For instance, say I have 10
| jobs that don't depend on each other, all of them pretty small.
|
| Dagster and even Luigi feel like overkill but I'd still like to
| plug those into a unified interface where I can view previous
| runs, mainly logs and exit codes. Being able to do some light
| job configuration or add retries would be nice but not
| required. For the moment I just use a logging handler that
| writes to a database table and that's fine
| disgruntledphd2 wrote:
| I think that Airflow 2 implemented a decorator mode which you
| can just use on functions.
|
| Honestly, just use airflow, it has its issues but it sucks in
| well known and predictable ways.
| cicdw wrote:
| One of the goals of Prefect's SDK is to be minimally invasive
| from a code-standpoint (in the simplest case you only need
| two lines to convert a script to a `flow`). Our deployment
| model also makes infrastructure job config a first-class
| citizen so you might have a good time trying it out.
| (disclosure: work at Prefect)
| recursive4 wrote:
| Either Perfect or Dagster. FWIW, the Dagster team is actively
| reducing the learning curve with each release.
| rasmusab wrote:
| Pure python scripts, maybe using the #%%-convention
| (https://code.visualstudio.com/docs/python/jupyter-support-py...)
| so you get the best of both notebooks and scripts, in a right-
| sized instance/container/machine. And if you need to run jobs in
| parallel, then orchestrate using make, like so:
| https://www.sumsar.net/blog/makefile-recipe-python-data-pipe...
| niwtsol wrote:
| Yeah, I love this -- pure Python with cron or periodic tasks
| (e.g., Django) works great. Celery task for parallelization,
| and if you pipe logs/alerts into a Slack channel, you can
| actually get really far without needing a "proper"
| orchestration layer.
|
| I recently took over an Airflow system from a former colleague,
| and in our case, it's just overly complex for what's really a
| pretty simple data flow.
| myfakebadcode wrote:
| I've been using airflow for quite some time. Due to the maturity
| of where we are at, and while I've tested other solutions, I
| don't really see changing things.
| scary-size wrote:
| We've migrated to Flyte. Mostly using the Java/Scala API which
| can be a bit verbose. The official Python API is actually easy on
| the eyes.
| fforflo wrote:
| Makefile with make2graph to visualize DAGs.
| fmariluis wrote:
| If you're inside AWS, have a fully containerized workflow and/or
| can run some tasks in Lambda, Step Functions is probably ok? I
| personally prefer Airflow, but I wouldn't say is the 'simplest
| data orchestration tool'.
| saturn8601 wrote:
| I used to work for an automation company that produced a product
| called ActiveBatch. It was such an amazing tool for just drag and
| drop automation. Its focus was on full fledged workflow
| automation and not just data orchestration.
|
| What I loved was its simplicity + its out of the box features. To
| set it up just took a simple MS SQL DB + An Installer. Bam you
| are up and running an absolute rock solid scheduler(i've seen
| million+ jobs running on it without it breaking a sweat). Then
| you could install (or use it to deploy) execution agents to all
| the servers you wanted as workers.
|
| It also installed a robust Desktop GUI that had so many services
| built in ready to go (anything from executing scripts all the way
| to performing direct actions against countless products a company
| would have or against various cloud services).
|
| There were so many pre built actions where all you had to do was
| input credentials and it would enumerate the appropriate
| properties from that service automatically. Then you could
| connect things together (ie, pull something from the cloud,
| process it on some other server, store it, pass it along to
| another service, whatever you wanted)
|
| Only problem was this is very much a B2B application and their
| sales is really only interested in selling to enterprises and not
| end users. I really wish we had something like this that regular
| people could download.
|
| Everything ive seen listed here requires extensive setup,requires
| coding, or does not have a robust desktop GUI but instead some
| half baked web gui which might require dropping back down to
| scripts/coding. You could set up hundreds/thousands of automated
| steps in ActiveBatch without writing a single line of code. I
| miss that product.
| rich_sasha wrote:
| I wrote my own in half a day. Worked 24/7 for 3 years... then I
| quit.
|
| Seriously, took me much less time than setting up airflow. Even
| had a webpage in the end, with all the tasks, a tree view,
| downstream, upstream tasks (these were incremental improvements
| beyond the initial half-day), CLI... The works.
|
| I now know the points of fragility I didn't know before, but I'd
| do it again.
| itfollowsthen wrote:
| at my last startup I asked a friend to help me debug an Airflow
| DAG. he just pip installed prefect and I've never really looked
| back. at the time everything else felt too hard to figure out.
| djsjajah wrote:
| I few people have mentioned dagster and I took a look at that for
| some machine learning things I was playing with but I found dvc
| (data version control [1]) and I think it is fantastic. I think
| it also has more applications than just machine learning but
| really anything with data. If you have a bunch of shell scripts
| that write to files to pass data around, then dvc might be a good
| fit. it will do things like only rerun steps if it needs to. Also
| for totally non-data stuff, Prefect is great.
|
| [1] https://dvc.org
| rubenfiszel wrote:
| You should give a try to Windmill, it's more of a workflow engine
| than a data orchestration tool but it's intuitive and open-
| source.
| speedgoose wrote:
| I like having containers running as CronJobs or Deployments in
| Kubernetes, but Argo Workflow has been a pretty reliable plugin
| to Kubernetes for the more advanced scenarios.
|
| However, it's simple only if you are already familiar with
| software containers and Kubernetes. But it's perhaps better to
| learn than having to deal with dependency hell in Python or Java.
| tdeck wrote:
| The simplest for sure was using ActiveJob in Rails with Clockwork
| for scheduling and Postgres for queueing things up.
___________________________________________________________________
(page generated 2025-03-21 23:02 UTC)