[HN Gopher] Lessons learned from running Apache Airflow at scale
___________________________________________________________________
Lessons learned from running Apache Airflow at scale
Author : datafan
Score : 218 points
Date : 2022-05-23 15:31 UTC (7 hours ago)
(HTM) web link (shopify.engineering)
(TXT) w3m dump (shopify.engineering)
| rr808 wrote:
| Surely there is a really simple distributed scheduler that is
| simple. Do I need to write one? Ie has dependencies, no database,
| flat files, single instance but trivial to fail over to a backup.
| I can even live without history or output.
| 0xbadcafebee wrote:
| What are you trying to do? Distributed scheduler with a single
| instance? No database? Are you sure you don't just mean "a
| scheduler" ala Luigi? https://github.com/spotify/luigi
|
| And what kind of scheduler? Again, for "a single instance" it
| doesn't need to be distributed. For distributed operation,
| Nomad is as simple and generic as you can get. If you need to
| define a DAG, that's never going to be simple.
| qkhhly wrote:
| airflow is one piece of software that i hate very much,
| especially the aspect that my job definition is intertwined with
| the actual job code. if my job depends on something that
| conflicts with airflow's dependency, it gets ugly.
|
| i actually like azkaban a lot better. of course, writing a plain
| text job config could also be painful. i think ideally you could
| write job def in python or other lang but it gets translates to
| plain text config and does not interfere with your job code in
| any way.
| Mayzie wrote:
| The biggest pain point in Airflow I have experienced is the
| horrible and completely lacking documentation. The community
| support (Slack) won't (or can't) help with anything beyond basic
| DAG writing.
|
| That sore point makes running and using the software needlessly
| frustrating, and honestly I won't ever be using it again because
| of it.
| 8589934591 wrote:
| I agree with this. The slack is just the core developers
| discussing further development and tickets. The documentation
| is lacking big time. The only response to this is to raise PRs
| to improve docs.
| idomi wrote:
| Make sure to checkout ploomber, our support is seamless tons of
| docs (https://docs.ploomber.io/) and we take our users
| seriously. P.S. We integrate with airflow and other
| orchestrators if you still need to tackle those.
| pid-1 wrote:
| Agreed, I just would like to add the documentation got a lot
| better in the past couple of years.
| artwr wrote:
| Can I ask more about your use case that you could not find an
| answer for?
| jlaneve wrote:
| That's one of the things we're working on at Astronomer - check
| out the Astronomer Registry! registry.astronomer.io
| higeorge13 wrote:
| I am wondering why they still use the celery executor while the
| kubernetes executor is the go-to one for large deployments. I
| have used the celery executor and had so many issues and stuck
| tasks in the past and frequently fine-tune the celery
| configuration in airflow config.
| mcqueenjordan wrote:
| I think airflow ends up creating as many problems as it solves
| and kind of warps future development patterns/designs into its
| black hole when it wouldn't otherwise be the natural choice.
| There's the sort of promise of network effects -- "well of course
| it's better if /everything/ is represented and executed within
| the DAG of DAGs, right?" -- but it ends up being the case that
| the inherent problems it creates plus the externalities of using
| airflow for the wrong use cases start to compound, especially as
| the org grows.
|
| I think it slowly ends up being sort of isomorphic to the set of
| problems that sharing database access across service and
| ownership boundaries has, and my view is increasingly of the
| "convince me this can't be an RPC call, please" camp, and when it
| really can't (for throughput reasons, for example), "ok, how
| about this big S3 bucket as the interface, with object
| notification on writes?"
| trumpeta wrote:
| We operate a (small?) Airflow instance with ~20 DAGs but, one of
| those dags has ~1k tasks. It runs on k8s/aws setup with a MySQL
| backing it.
|
| We package all the code in 1-2 different Docker images and then
| create the DAG. We've faced many issues (logs out of order,
| missing, random race conditions, random task failures, etc.)
|
| But what annoys me the most is that for that 1 big DAG, the UI is
| completely useless, tree view has insane dupplication, graph view
| is super slow and hard to navigate through and answering basic
| questions like, what exactly failed and what nodes are around it
| are not easy.
| artwr wrote:
| At Airbnb, we were using SubDAGs to try to manage large number
| of tasks in a single DAG. This allowed organizing tasks and
| drilling down into failures more easily but came with its own
| challenges.
|
| In more recent versions of Airflow, TaskGroups
| (https://airflow.apache.org/docs/apache-
| airflow/stable/concep...,
| https://www.astronomer.io/guides/task-groups/ ) were made to
| help this a little bit. Hopefully that helps a bit.
|
| At ~1k nodes in the graph introspection becomes hard anyway, as
| others have suggested, breaking it down if possible might be a
| good idea.
| mywittyname wrote:
| Also, the @task annotation provides no facilities to name
| tasks. So if you like to build reusable tasks (as I do), you
| end up with my_generic_task__1, my_generic_task__2,
| my_generic_task__n. I've tried a few hacks to dynamically
| rename these, but I just ended up bringing down my entire
| staging cluster.
| artwr wrote:
| `your_task.override(task_id="your_generated_name")` not
| working for you?
| mywittyname wrote:
| I got pretty excited when I read this response, but no, it
| doesn't work. I'm not sure how this would work since
| annotated tasks return an xcom object.
|
| Can you point me to the documentation on this function?
| It's possible I'm not using it correctly.
|
| I can do something like this, which works locally, but
| breaks when deployed: res =
| annotated_task_function(...) res.operator.task_id =
| 'manually assigned task id'
| flowair wrote:
| @task.python(task_id="this_is_my_task_name")
|
| def my_func():
|
| ...
| mywittyname wrote:
| This still has the problem that, when you call my_func
| multiple times in the same dag, the resulting tasks will
| be labelled, my_func, my_func__1, my_func__2, ...
| suifbwish wrote:
| Does this imply file metadata content can effect the access
| performance of those files even for operations that do not
| directly concern the metadata?
| rockostrich wrote:
| We had a similar DAG that was the result of migration a single
| daily Luigi pipeline to Airflow. I started identifying isolated
| branches and breaking them off with external task sensors back
| to the main DAG. This worked but it's a pain in the ass. My
| coworker ended up exporting the graph to graphviz and started
| identifying clusters of related tasks that way.
| mywittyname wrote:
| I've not had the best luck with ExternalTaskSensors. There
| have been some odd errors like execution failing at 22:00:00
| every day (despite the external task running fine).
| vbezhenar wrote:
| Can someone enlighten me whether Apache Airflow is suitable as a
| business process engine?
|
| We have something like orders. So people put orders into our
| system, some orders are imported from external system. We have
| something around 100-1000 orders per day, I think. Each order
| goes through several states. Like CREATED, SOME_INFO_ADDED,
| REVIEWED, CONCLUSION_CREATED, CONCLUSION_SENT_TO_EXTERNAL_SYSTEM
| and so on. Some states are simple to change, like few
| milliseconds to call some web services, some states are 5 minutes
| from operator, some states are few days. This logic is encoded
| into our program code. We have plenty of timers, every timer
| usually transfers orders from one state to another. This is
| further complicated by the fact that this processing is done via
| several services, so it's not a single monolith but some kind of
| service architecture.
|
| Our management wants something to have clear monitoring, so you
| can find a given task by some property values, monitor its
| lifetime, check logs for every step, find out why it's failing,
| etc.
|
| What I usually see is that Apache Airflow is used more like cron
| replacement. I've read some articles but it's still not clear
| whether it could be used as a business process engine. I had some
| experience with Java BPMN engines in the past, it was not very
| pleasant, but I guess time moved on.
| subsaharancoder wrote:
| A friend of mine wanted an ETL (SQL Server to BQ for analysis and
| dashboarding) set up and I ended up stumbling across Airflow. I
| spun up two VMs on GCP, one for Airflow and the other for the
| Postgres DB to store the metadata.
|
| - A few things I've noticed is Airflow generates a tonne load of
| logs that will fill up your disk quite fast. I started with 100GB
| and I'm now at 500GB, granted disk space isn't expensive, but
| still even with a few DAGs i'm surprised at how quickly.
| Apparently you need a DAG to run to clear those logs but I was
| too lazy so I just purge the logs using a cron job.
|
| - The SQL Server Operator is buggy, I filed an issue with the
| Airflow team but I had to do some hacky stuff to get it to work.
|
| - Even with a few DAGs, Airflow will spike the CPU utilization of
| the VM to 100% for X minutes (in my case about 15 minutes) which
| is quite interesting. My tasks basically query SQL Server -> dump
| to CSV (stored on GCS) -> import to BQ.
|
| - My DAGs execute every hour, and if Airflow is down for X hours
| and I resolve the issue, it will try to run all the tasks for the
| hours it was down which isn't ideal because it will take hours to
| catch up. So I've had to delete tasks and only run the most
| recent ones.
|
| Granted my set up is pretty simple and YMMV, but Airflow has done
| what it needs to do albeit with some pain.
| veeti wrote:
| FWIW if you don't need Airflow to catch up and backfill missed
| tasks, you can either set catchup=False on the DAG or use a
| LatestOnlyOperator.
| subsaharancoder wrote:
| I have catchup=False set in the DAG but that hasn't stopped
| Airflow from back filling missed tasks, not sure why this is
| the case?
| awild wrote:
| > Even with a few DAGs, Airflow will spike the CPU utilization
| of the VM to 100% for X minutes (in my case about 15 minutes)
| which is quite interesting. My tasks basically query SQL Server
| -> dump to CSV (stored on GCS) -> import to BQ.
|
| Have you checked why that is? Airflow does Reimport every few
| seconds. We've had an issue where it didn't honor the
| airflowignore file making it execute our tests everx few
| seconds. The easy solution was to put them into the docker
| ignore.
|
| You might also be having too much logic in your root levels.
| It's recommended to not even import at root level to make
| importing faster.
|
| Not saying it's not an odd tool though.
| vvladymyrov wrote:
| Airflow brought one of the best tools with nice UI for running
| pipelines back in 2014-2016. But now days engineers should be
| aware about easier to use options and don't choose Airflow
| blindly as default choice. IMHO for 80-90% of cases orchestration
| system should not use code at all - it should be DAGs as a config
| code. Airflow is popular and teams keep choosing it for building
| simple DAGs and incurring avoidable otherwise Airflow maintenance
| costs.
|
| Databricks Orchestration pipeline, AWS Step Functions - good
| examples of DAGs as a configuration.
| carschno wrote:
| Do you have more examples for better tools, ideally open source
| (unlike AWS Step functions)?
| jonpon wrote:
| We at magniv.io are building an alternative.
|
| Our core is open source https://github.com/MagnivOrg/magniv-
| core
|
| We can set you up with our hosted if you would like to poke
| around!
| gadflyinyoureye wrote:
| Flowable is a BPNM system. You can do a lot of async calls
| with it. https://www.flowable.com/open-source
|
| We use it for a complex pricing process that invokes 30-40
| micro services that can take up to minutes per step.
| qw wrote:
| Kamelets and the Karavan UI in combination with k8s and
| Knative for "serverless" integrations looks interesting.
| antupis wrote:
| We are using prefect+dbt and I like it. Altought they are
| doing huge rewrite at the moment.
| [deleted]
| avemg wrote:
| I've used AWS Step Functions extensively over the past several
| years and give me code every day of the week over the
| Stepfunctions json config. Once you get beyond a few simple
| steps, it gets very hard to look at the config and understand
| what's going on with it. Especially true when you haven't look
| at the config in awhile. The DAG visualizer definitely helps,
| but as soon as things get beyond the trivial I long for a
| different tool.
| coolsunglasses wrote:
| I was until a week or two ago part of a team that build
| datasets with extensive dependencies (thus, complicated DAGs)
|
| v1 of the system built before I joined was Step Functions and
| the like. It gets hairy just as you say.
|
| v2 I built and designed with the lead data engineer, we
| called it Coriaria originally. We're hoping/planning to open
| source it eventually, although it's a little wrapped around
| our company's internal needs & systems.
|
| It chooses neither "config" strictly speaking nor "code" for
| the DAG, instead the primary representation/state is all in
| the PostgreSQL database which tracks the dataset dependencies
| and how each dataset is built. It's a DAG in PostgreSQL as
| well.
|
| To make dataset creation and management easier, I also wrote
| a custom Terraform provider for Coriaria. This made migrating
| datasets into the new system dramatically faster. The
| provider is really nice, supports `terraform import` and all
| that. Currently we have it setup so that there are separate
| roles/accounts that can modify an existing dataset, but
| reading state only requires authentication, not
| authorization. This enables one team to depend on another
| team's dataset as an upstream data source for their datasets
| without granting permission to modify it or create a
| potentially stale copy of the dataset. Terraform's internal
| DAG representation of the resource dependencies is leveraged
| because "parent_datasets" references the upstream datasets
| directly, including the ones we don't build.
|
| We're able to depend on datasets we don't build ourselves
| because the system has support for Glue catalog backends to
| track and register partition availability.
|
| Currently, it builds most of the datasets using AWS Athena &
| S3, however this is abstracted over a single step function.
| There's no DAG of step functions, it's just a convenient
| wrapper for the Athena query execution.
|
| The system also explicitly understands dataset builds and
| validations as separate steps. The dashboard makes it easy to
| trace the DAG and see which datasets are blocking a dataset
| build.
|
| We're adding more integrations to it soon so that other ways
| of kicking off dataset builds and validations are available.
|
| If people are interested in this I can begin lobbying for
| open sourcing the system. My colleague wanted to open source
| it as well.
|
| All else fails, I'll rebuild it from scratch because I don't
| like the existing solutions for managing datasets. We've been
| calling it a data-flow orchestration system or ETL
| orchestration system, not sure what would be most meaningful
| to people.
|
| I think the main caveat to this system is that I'm not sure
| how much use it'd be for streaming data pipelines, but it
| could manage the discretization of streaming into validated
| partitions wherever streamed data is sunk into. Our operating
| assumptions are that you want validated datasets to drive
| business decisions, not raw event data streamed in from
| Kafka. Making sure the right data is located in each daily
| (or hourly) partition is part of that validation.
| latchkey wrote:
| Why not just model the json as objects in (insert favorite
| language) and then use that code to generate the json?
| entropicdrifter wrote:
| Ah yes, a home-made framework to generate configurations
| for your framework that's supposed to make your life
| easier. That way you can maintain your code that maintains
| your configs that make it easier to run your code that you
| have to maintain!
| latchkey wrote:
| Actually, yes. It allows for easier unit and integration
| testing as well. The original complaint is that things
| were getting hard to read and they wished there was code
| for this. It seems logical to create a framework for the
| json configuration files so that they can be easily
| mocked and tested. As someone who greatly values spending
| time on automated testing, it seems weird to not think of
| it this way.
|
| Quick google shows that others have done things like this
| already...
|
| [1] https://noise.getoto.net/2021/10/14/using-jsonpath-
| effective...
|
| [2] https://aws.amazon.com/about-aws/whats-
| new/2022/01/aws-step-...
|
| [3] https://docs.aws.amazon.com/step-
| functions/latest/dg/sfn-loc...
| pharmakom wrote:
| This can be a much better approach than upgrading the DAG
| description language to a true programming language. It
| forces anything complex to happen at build time where it
| can do less damage. Plus, we can often use the same
| library to do static analysis on the output
| savin-goyal wrote:
| Metaflow provides a similar concept to interface with
| Step Functions and Argo Workflows in Python -
| https://docs.metaflow.org/going-to-production-with-
| metaflow/...
| [deleted]
| TYPE_FASTER wrote:
| AWS offers a service for managed Airflow:
| https://aws.amazon.com/managed-workflows-for-apache-airflow/
|
| Makes me wonder if Amazon internally was using Step
| Functions, ran into issues trying to scale to larger graphs,
| realized multiple teams were using Airflow, and created the
| Managed Airflow service.
| nomilk wrote:
| > teams keep choosing it for building simple DAGs
|
| I am part of one such team. We were using Windows Task
| Scheduler on a windows VM to run jobs, we figured it would be a
| nice idea to (dramatically) modernise and move to airflow, but
| we grossly underestimated the complexity, learning curve, and
| surrounding tools it requires. In the end we (data science
| team) didn't get a single production task up and running. The
| data engineers had much more success with it thought, probably
| because they dedicated much more time to it.
|
| Will look forward to trying AWS Step Functions.
| commandlinefan wrote:
| I tried installing Airflow locally to just play around with
| it and make sense of what it's good for and finally gave up
| after a few days - the install alone is insanely complicated,
| with lots of tricky hidden dependencies.
| always_left wrote:
| Did you try installing with docker? You would just download
| docker, `docker-compose up --build` and you'll be good to
| go locally (usually)
| idleprocess wrote:
| I can second this. We were up-and-running with Docker on
| our dev machines in just a few minutes. A native
| installation involves substantially more setup (Python,
| databases, Redis and/or Rabbit, etc.). The published
| docker-compose file will handle all of that for you. We
| have a very small data engineering team and have been
| able to move very quickly with Docker and AWS ECS (for
| orchestrating containers in test and prod environments).
| anonymousDan wrote:
| Can anyone ELI5 the value proposition of airflow?
| lysecret wrote:
| Well written article. One question I always have when reading
| such an article. Is it really worth it for these kinds of
| companies to run Airflow on Kubernetes. You could also run it for
| example on AWS Batch with Spot instances.
| beckingz wrote:
| Running Airflow on Kubernetes has been one of the most painful
| data engineering challenges I've worked on.
| ricklamers wrote:
| We kept hearing this from our users. We've just released our
| k8s operator based deployment of Orchest that should give you
| a good experience running an orchestration tool on k8s
| without much trouble. https://github.com/orchest/orchest
|
| (We extended Argo, works fantastically well by the way!)
| marcinzm wrote:
| How so? Did you have any existing Kubernetes knowledge? We
| found it fairly easy to deploy using the community Helm chart
| (official chart wasn't out yet).
| mrbungie wrote:
| Did you have any previous experience running workloads in k8s
| before?
|
| Running the Airflow Helm is pretty straightforward, even with
| more "complex" use cases like heterogenous pods for different
| task sizes.
| kbd wrote:
| I'm bullish about Dagster nowadays. Though, I don't have a lot of
| experience with Airflow. Figured I'd ask if anyone has switched
| from Airflow to Dagster and has any comments?
| perfect_kiss wrote:
| I had participated in migrating around 100 fairly complicated
| pipelines from Airflow to Dagster over six months in 2021. We
| used k8s launcher, so this feedback does not apply to other
| launchers e.g. Celery.
|
| Key takeaways roughly those:
|
| - Dagster's integration with k8s really shines as compared to
| Airflow, it is also based on extendable Python code so it is
| easy to add custom features to the k8s launcher if needed.
|
| - It is super easy to scale UI/server component horizontally,
| and since DAGs were running as pods in k8s, there was no
| problem scaling those as well. For scheduling component it is
| more complicated, e.g. builtin scheduling primitives like
| sensors are not easily integrated with state-of-art message
| queue systems. We ended up writing custom scheduling component
| that was reading messages from Kafka and creating DAG runs via
| networked API. It was like 500 lines of Python including tests,
| and worked rock-solid.
|
| - networked API is GraphQL while Airflow is REST, both are
| really straightforward, however in Dagster it felt better
| designed, maybe due to tighter governance of Dagster's authors
| over the design.
|
| - DAG definition Python API, e.g. solid/pipeline, or op/graph
| in a newer Dagster API, is somewhat complicated as compared to
| Airflow's operators, however it is easy to build custom DSL on
| top of that. One would need custom DSL for complicated logic in
| Airflow as well, and in case of Dagster it felt easier to
| generate its primitives, than doing never ending operators
| combinations in case of Airflow.
|
| - Unit and integration testing are much easier in Dagster, the
| authors put testing as a first-class citizen, so mocks are
| supported everywhere, and the code tested with local runner is
| guaranteed to execute in the same way on k8s launcher. We never
| had any problems with test environment drift.
|
| The biggest caveat was full change of internal APIs in 0.13,
| which forced the team to execute a fairly complicated refactor,
| due to deprecation of the features we were depending on e.g.
| execution modes. Had we spent more time on Elementl slack, it
| would be easier to put less dependencies on those features ^__^
| doom2 wrote:
| At my previous employer, we were running self-hosted Airflow in
| AWS, which really was a nightmare. The engineer that set it up
| didn't account for any kind of scaling and all the code was a
| mess. We would also get issues like logs not syncing correctly
| in our environment or transient networking issues that somehow
| didn't fail the given Airflow task. Eventually, we did a dual
| migration: temporarily switching to AWS managed Airflow (their
| Amazon Managed Workflows for Apache Airflow product) while also
| rewriting the DAGs in Dagster.
|
| Dagster was a great solution for us. Their notion of software
| defined assets allowed us to track metadata of the Redshift and
| Snowflake tables we were working with. Working with re-runs and
| partitioned data was a breeze. It did take a while to onboard
| the whole team and get things working smoothly, which was a bit
| difficult because Dagster is still young and they were often
| making changes to how parts of the system worked (although
| nothing that was immediately backwards incompatible).
|
| We also enjoyed some of the out of the box features like
| resources and unit testing jobs. Overall, I think it made our
| team focus more on our data and what we wanted to do with it
| rather than feeling like we had to wrangle with Airflow just to
| get things running.
| kbd wrote:
| Thanks for your comment! Ditto last time I ran Airflow
| locally it took like 5 Docker containers. Then I forgot about
| the project and for a while was furious at Docker for
| randomly taking 100% CPU. Then I realized it was because of
| the Airflow containers that would restart along with Docker.
| I didn't get much further with Airflow.
|
| Dagster, on the other hand, seems to let you scale from using
| it locally as a library all the way to running on ECS/K8s
| etc. Along with that there's unfortunately a ton of
| complexity in setting it up but that's not Dagster's fault
| and it seems like Dagster works once you get it set up. Agree
| about it being young and there being some rough spots but
| it's got lots of good ideas. We were nearly done setting it
| up but got pulled off onto more urgent things, so I haven't
| run it in production yet. I'm glad to hear it worked well for
| you!
| computershit wrote:
| Dagster is extremely nice to work with. I did a bakeoff of
| Prefect vs Dagster internally at my current employer, and while
| we ended up going with Prefect for reasons, I am still so
| impressed with the way Dagster approaches certain pain points
| in the orchestration of data pipelines and its solution for
| them.
| theptip wrote:
| > for reasons
|
| I'd love to hear more on this. I've not evaluated Prefect,
| and am currently keeping an eye on Dagster. What trade-offs
| does Prefect win?
| 64StarFox64 wrote:
| I did a baby bakeoff internally in my prior role ~18mo ago
| now. Prefect felt nicer to write code in but perhaps not as
| easy to find answers in the docs (though their Slack is
| phenomenal). Ended up going with Prefect so I could focus
| on biz/ETL logic with less boilerplate, but I'm sure
| Dagster is not a bad choice either. Curious to hear about
| parent's experience
| simo7 wrote:
| I think the main lesson should be not to use it, especially at
| scale.
| 0xbadcafebee wrote:
| If you have the headcount for people just to build/support
| Airflow, please do yourself a favor and give that money to
| Astronomer.io. Their offering is _stupid good_. There 's 20
| different reasons why paying them is a much better idea than
| managing Airflow yourself (including using MWAA), and it's dirt
| cheap considering what you get.
| [deleted]
| pid-1 wrote:
| Last time I checked, they asked for a significant minimum $ + 1
| year commitment.
|
| I wish they had a "start small", self service, clear pricing
| option.
| emef wrote:
| We've also been running airflow for the past 2-3 years at a
| similar scale (~5000 dags, 100k+ task executions daily) for our
| data platform. We weren't aware of a great alternative when we
| started. Our DAGs are all config-driven which populate a few
| different templates (e.g. ingestion = ingest > validate > publish
| > scrub PII > publish) so we really don't need all the
| flexibility that airflow provides. We have had SO many headaches
| operating airflow over the years, and each time we invest in
| fixing the issue I feel more and more entrenched. We've hit
| scaling issues at the k8s level, scheduling overhead in airflow,
| random race conditions deep in the airflow code, etc. Considering
| we have a pretty simplified DAG structure, I wish we had gone
| with a simpler, more robust/scalable solution (even if just
| rolling our own scheduler) for our specific needs.
|
| Upgrades have been an absolute nightmare and so disruptive. The
| scalability improvements in airflow 2 were a boon for our
| runtimes since before we would often have 5-15 minutes of
| overhead between task scheduling, but man it was a bear of an
| upgrade. We've since tried multiple times to upgrade past the 2.0
| release and hit issues every time, so we are just done with it.
| We'll stay at 2.0 until we eventually move off airflow
| altogether.
|
| I stood up a prefect deployment for a hackathon and I found that
| it solved a ton of the issues with airflow (sane deployment
| options, not the insane file-based polling that airflow does). We
| looked into it ~1 year ago or so, I haven't heard a lot about it
| lately, I wonder if anyone has had success with it at scale.
| pweissbrod wrote:
| If your team is comfortable writing in pure python and you're
| familiar with the concept of a makefile you might find Luigi a
| much lighter and less opinionated alternative to workflows.
|
| Luigi doesn't force you into using a central orchestrator for
| executing and tracking the workflows. Tracking and updating
| tasks state is open functions left to the programmer to fill
| in.
|
| It's probably geared for more expert programmers who work close
| to the metal that don't care about GUIs as much as high degrees
| of control and flexibility.
|
| It's one of those frameworks where the code that is not written
| is sort of a killer feature in itself. But definitely not for
| everyone.
| teej wrote:
| It's worth noting that Luigi is no longer actively maintained
| and hasn't had a major release in a year.
| pyrophane wrote:
| Very similar experience to yours. Adopted Airflow about 3 years
| ago. Was aware of Prefect but it seemed a bit immature at the
| time. Checked back in on it recently and they were approaching
| alpha for what looked like a pretty substantial rewrite (now in
| beta). Maybe once the dust has settled from that I'll give it
| another look.
| throwusawayus wrote:
| creator of prefect was an early major airflow committer.
| anyone know what motivated the substantial rewrite of
| prefect? i had assumed original version of prefect was
| already supposed to fix some design issues in airflow?
| timost wrote:
| I think you mean prefect orion/v2[0]. I'm curious too.
|
| [0] https://www.prefect.io/orion/
| dopamean wrote:
| If you could go back and use something else instead what would
| you choose?
| emef wrote:
| It's a good question. I believe airflow was probably the
| right choice at the time we started. We were a small team,
| and deploying airflow was a major shortcut that more or less
| handled orchestration so we could focus on other problems.
| With the aid of hindsight, we would have been better off
| spinning off our own scheduler some time in the first year of
| the project. Like I mentioned in my OP, we have a set of
| well-defined workflows that are just templatized for
| different jobs. A custom-built orchestration system that
| could perform those steps in sequence and trigger downstream
| workflows would not be that complicated. But this is how
| software engineering goes, sometimes you take on tech debt
| and it can be hard to know when it's time to pay it off. We
| did eventually get to a stable steady state, but with lots of
| hair pulling along the way.
| hbarka wrote:
| dbt tool. getdbt.com
| mywittyname wrote:
| Can dbt run arbitrary code? If it can, it's not well
| advertised in the documentation. Every time I've looked
| into dbt, I found that it's mostly a scheduled SQL runner.
|
| The primary reason we run Airflow is because it can execute
| Python code natively, or other programs via Bash. It's very
| rare that a DAG I write is entirely SQL-based.
| hbarka wrote:
| You're right. I think the strength of dbt is in the T
| part of ELT. I wrote ELT to make a distinction in
| principle from the traditional ETL. (E)xtract and (L)oad
| is the data ingestion phase that would probably be better
| served by Dagster, where you could use Python.
|
| (T)transform is decoupled and would be served in set-
| based operations managed by dbt.
| igrayson wrote:
| dbt has just opened a serious conversation about
| supporting Python models. I'm sure they'd value your
| viewpoint! https://github.com/dbt-labs/dbt-
| core/discussions/5261
| KptMarchewa wrote:
| Dbt is great, but solves only a small part of what Airflow
| does.
| digisign wrote:
| Is Airflow good for an ETL pipeline? Right now a client uses
| Jenkins, but it is quite clunky and difficult to automate, though
| they've managed to. Cloud not an option.
| theptip wrote:
| Airflow is generally brought in when you have a DAG of jobs
| with many edges, and where you might want to re-run a sub-
| graph, or have sub-graphs run on different cadences.
|
| In a simplistic ETL/ELT pipeline you can model things as
| "Extract everything, then Load everything, then Transform
| everything", in which case you'll add a bunch of unnecessary
| complexity with Airflow.
|
| If you're looking for a framework to make the plumbing of ELT
| itself easier, but don't need sub-graph dependency modeling,
| Meltano is a good option to consider.
| skrtskrt wrote:
| Could anyone comment on Temporal vs Airflow?
|
| After having a lot of pain points with an (admittedly older and
| probably not best-practices) Airflow setup, I am now at a
| different job running similar types of workflows on Temporal -
| we're pretty happy with it so far, but haven't done anything
| crazy with it.
| matesz wrote:
| I know airbyte.io (elt platform) is built on top of Temporal,
| but I haven't used it.
| tomwheeler wrote:
| Yes, Airbyte is using Temporal. Here is a blog post they
| wrote a few weeks ago that goes into more detail about it:
| https://airbyte.com/blog/scale-workflow-orchestration-
| with-t...
| Serow225 wrote:
| I'd love to hear that too :)
| tomwheeler wrote:
| Hi, Tom from Temporal here. I don't have a lot of experience
| with Apache Airflow personally, but I was at Cloudera when it
| was added to our Data Engineering service, so I learned about
| it at the time. Here are a few things that come to mind:
|
| * Both Apache Airflow and Temporal are open source
|
| * Both create workflows from code, but the approach is
| different. With Airflow, you write some code and then
| generate a DAG that Airflow can execute. With Temporal, your
| code _is_ your workflow, which means you can use your
| standard tools for testing, debugging, and managing your
| code.
|
| * With Airflow, you must write Python code. Temporal has SDKs
| for several languages, including Go, Java, TypeScript, and
| PHP. The Python SDK is already in beta and there's work
| underway for a .NET SDK.
|
| * Airflow is pretty focused on the data pipeline use case,
| while Temporal is a more general solution for making code run
| reliably in an unreliable world. You can certainly run data
| pipeline workloads on Temporal, but those are a small
| fraction of what developers are doing with Temporal (more
| here: https://temporal.io/use-cases).
| claytonjy wrote:
| Do you see Temporal as being a super-set of DAG managers
| like Airflow/Dagster/Prefect, or do you see uses where
| those tools would be a better choice than Temporal?
| claytonjy wrote:
| I'm also curious about this. The folks I hear about Temporal
| from seem to be very disjoint from Airflow users, and
| Temporal's python client is still alpha-stage.
|
| It seems notable to me that the big Prefect rewrite mentioned
| elsewhere [0] leans into the same "workflow" terminology that
| Temporal uses. I have to wonder if Prefect saw Temporal as
| superceding the DAG tools in coming years and this is them
| trying to head that off.
|
| That post's discussion of DAG vs workflow also sounds a _lot_
| like why PyTorch was created and has seen so much success.
| Tensorflow was static graphs, pytorch gave us dynamism.
|
| [0] https://www.prefect.io/blog/announcing-prefect-orion/
| encoderer wrote:
| Is anybody out there doing anything interesting with Airflow
| monitoring?
|
| At my startup Cronitor we have an Airflow sdk[0] that makes it
| pretty easy to provision monitoring for each DAG, but essentially
| we are only monitoring that a DAG started on time and the total
| time taken. I keep thinking about how we could improve this and
| it would be great to hear about what's working well today for
| monitoring.
|
| [0] https://github.com/cronitorio/cronitor-airflow
| rozhok wrote:
| I'm working at https://databand.ai -- a full-fledged solution
| for Apache Ariflow monitoring, data observability and lineage.
| We have airflow sync, integrations with
| Spark/Databricks/EMR/Dataproc/Snowlake, configurable alerts,
| dashboards, and a much more. Check it out.
| AtlasBarfed wrote:
| ... the service seems to be centrally managed. A lot of the pain
| points are clearly "everyone running in the same instance" or
| kind of similar. Sure makes for big brag points in the numbers.
|
| Sounds like basic SaaS needs to be provided as a capability,
| while the teams spin up their instances and shard to their needs.
|
| One of the problems with enterprise workflows is putting
| everything together. Workflows are already cacophonous. A
| cacophony of cacophonies is madness.
| taude wrote:
| Tangentially to this thread....what sites, sources, etc. are
| people who work on modern data pipelines (engineering and
| analysts) going to follow the latest news, products, techniques,
| etc. It's been hard to keep up without having Meetups and such
| the last couple years. I'm finding a lot of people's comments
| here pretty interesting, and showing me things I haven't heard
| of. Thanks.
| kderbe wrote:
| I follow the Analytics Engineering Roundup weekly email. It's
| published by dbt Labs but isn't overtly promotional.
|
| https://roundup.getdbt.com/
| taude wrote:
| Thanks. We're starting to use DBT, too. I know the forums
| over at DBT are pretty good, too.
| mcnnowak wrote:
| I'm also interested in this topic, but can't find anything
| other than "Top 10 things you should STOP doing as a data
| engineer" etc. content-mill, clickbait on Medium and other
| sites.
| taude wrote:
| Yes, this. I'd like to get less of the "Marketing sales
| stuff", and more in the trenches with the actual engineering
| teams.
| blakeburch wrote:
| I've had really great success from engaging with the Locally
| Optimistic Slack community.
|
| Also, Cristophe Blefari has an excellent data newsletter.
| https://www.blef.fr/
|
| And Modern Data Stack has a newsletter, tool information, Q&A
| www.moderndatastack.xyz
| jonpon wrote:
| Data Twitter and Linkedin are great, there are a lot of people
| putting out some really good content. There are also a lot of
| substacks you can sign up for. Data Engineering Weekly is my
| fave
| dtjohnnyb wrote:
| Slack groups have filled in the meetup space in my life,
| mlops.community and locally optimistic are two of the best for
| what it sounds like you're looking for
| tinco wrote:
| What sort of workflows do you run in Apache Airflow? Are they
| automating interactions with partners/clients or internal
| communications? How can it become so scaled up that they (and
| many people in the comments here as well) have trouble managing
| the hardware? How can it become so complex that the workflows
| need to be expressed in DAG's? What's a workflow?
|
| I don't think I ever worked anywhere that had automated
| workflows, though my I only worked for small startups so far.
| ldjkfkdsjnv wrote:
| Unless you have extremely complex dependency graphs, I really
| don't think airflow is worth it. It's very easy to end up
| essentially writing an "orchestrator" using airflow, it allows
| for very flexible low level operations. The added complexity has
| minimal benefit, and like something like apache spark, what looks
| simple becomes hard to reason about in real world scenarios. You
| need to understand how it works under the hood, and get the best
| practices right.
|
| As mentioned elsewhere, AWS step functions are really the best in
| orchestration.
| arinlen wrote:
| > _As mentioned elsewhere, AWS step functions are really the
| best in orchestration._
|
| AWS Step Functions is a proprietary service provided
| exclusively by AWS, which reacts to events from AWS services
| and calls AWS Lambdas.
|
| Unless you're already neck-deep in AWS, and are already
| comfortable paying through the nose for trivial things you can
| run yourself for free, it's hardly appropriate to even bring up
| AWS Step Functions as a valid alternative. For instance,
| Shopify's articles explicitly mention they are running their
| services in Google Cloud. Would it be appropriate to tell them
| to just migrate their whole services to AWS just because you
| like AWS Step Functions?
| Jugurtha wrote:
| That was one the reasons we do "bring your own compute" with
| https://iko.ai so people who already have a billing account
| on AWS, GCP, Azure, DigitalOcean, can just get the config for
| their Kubernetes clusters and link them to iko.ai and their
| machine learning workloads will run on whichever cluster they
| select.
|
| If you get a good deal from one cloud provider, you can get
| started quickly.
|
| It's useful even for individuals such as students who get
| free credits from these providers: create a cluster and
| you're up and running in no time.
|
| Our rationale was that we didn't wanted to be tied to one
| cloud provider.
| parsnips wrote:
| https://github.com/checkr/states-language-cadence allows you
| to define workflows in states language over cadence.
| literallyWTF wrote:
| This is another symptom of a person who doesn't know what
| they're talking about really.
|
| It's like those stackoverflow answers that tell the user to
| stop using PHP and rewrite it in Python or something.
| riku_iki wrote:
| > already comfortable paying through the nose for trivial
| things you can run yourself for free
|
| But fault tolerant workflow engine is not trivial thing, it
| may cost you many engineer hours to build, monitor and
| maintain it, so outsourcing it to someone else is totally
| viable solution.
| arinlen wrote:
| > _But fault tolerant workflow engine is not trivial
| thing,_
|
| The complexity and risk of migrating cloud providers
| eclipses whatever problem you assign to "fault tolerant
| workflow engines".
|
| Any mention of AWS Step Functions makes absolutely no sense
| at all and reads at best like a non-sequitur.
| thorum wrote:
| I read it as a comment on the UX / developer experience,
| which can superior with Step Functions vs competition
| regardless of whether Step Functions is an appropriate
| (or even physically possible) option for non-AWS
| projects.
| nojito wrote:
| I'll never understand why individuals always default to cloud
| offerings when they are extremely expensive compared to a
| dedicated tool.
| wussboy wrote:
| I don't need to manage the cloud offering and that management
| time is expensive. Your befuddlement at this simple economic
| calculation is, well, befuddling.
| mywittyname wrote:
| It's easy to get started and you don't need to worry about
| infra.
|
| I've been a one-man army at places because leveraging these
| cloud offerings allows me to crank out working software that
| scales to the moon without much thought.
|
| I'd rather pay AWS/GCP to handle infra, so that I can get
| 2-3x as many project done.
| nojito wrote:
| None of these problems in airflow in this thread are due to
| infrastructure so how does using a cloud service solve
| anything?
| benjamoon wrote:
| It's easy to understand when you have lots of money, but no
| time. Cloud is simple and expensive, self managed is complex
| and cheap. Time's money and all that!
| nojito wrote:
| Except for this workflow.
|
| You won't get around any of the problems of airflow by
| moving to a cloud offering.
| 0xbadcafebee wrote:
| Why hire an expensive janitorial service to clean your
| office? Why hire a mechanic to fix your car?
| serial_dev wrote:
| It's shocking that some people cannot fathom that in
| certain scenarios cloud offerings make sense.
|
| They don't always make sense, in certain scenarios it is
| worth taking an open source, cloud independent tool, in
| some scenarios you can roll your own, but there are
| circumstances where it's a good choice using a tool your
| cloud provider gives you.
| literallyWTF wrote:
| Because they don't know what they're doing and aren't the
| ones paying the bill.
|
| "Oh I have to learn how to use and setup this tool? I think
| I'll just pay the equivalent salaries and be locked in..."
| waynesonfire wrote:
| You're going to pay the bill regardless, whether the
| employee is hired by your team or hired by the cloud
| vendor.
|
| I don't know how management works through this math, maybe
| managing people gets exhausting and they just want to out-
| source it so leadership doesn't have to deal with it and
| then they can just focus on the core product.
|
| And the above "I don't want to deal with it" reason isn't
| spoken of, the more more commonly touted benefit is cloud's
| "flexibility". Sure, but this is actually _really_
| expensive. Every cloud migration effort I've experienced is
| only just worthwhile to begin to talk about because the
| costs are based on long-term contracts of cloud resources,
| not the per-hour fees. Nice flexibility.
|
| With that said, the cloud may be a good place for
| prototyping where the infrastructure isn't the core value
| add and it's uncertain. A start-up is a prototype and so
| here we are. But, for an established company to migrate to
| the cloud and fire the staff that's maintaining the on
| premise resources.. I'm skeptical. More than likely, this
| leads to maintaining both cloud and on premise resources,
| not firing anyone, and thus, actually increasing costs for
| an uncomfortably long time.
|
| And for the folks on the ground, who don't pay the bills,
| the increase of accidental complexity is rather painful.
| saimiam wrote:
| I'm very much paying my own cloud bills but there is no
| chance I would be able to orchestrate some of the workflows
| I want to orchestrate if it were not for Step Functions.
|
| For a one person shop like me, AWS is a force multiplier.
| With it, I can do (say) 30% of what a dedicated engineer in
| a specific role could do. Without it, I'd be doing 0%.
|
| I really like this tradeoff for my particular situation.
| [deleted]
| taude wrote:
| Fast time to market with a fraction of the effort?
| pyrophane wrote:
| > As mentioned elsewhere, AWS step functions are really the
| best in orchestration.
|
| Why? Where else is this mentioned?
| tootie wrote:
| I read some of these of massively complex data architecture
| posts and I almost always come away asking "What the hell is
| this for?" I know Shopify is a huge business but I see this
| kind of engineering complexity and all I think is it has to
| cost tens of millions to build and operate and how could they
| possibly be getting ROI. There are ten boxes on that diagram
| and none of them have a user interface for anyone except other
| developers.
| generalpf wrote:
| A lot of times this is used for data warehousing so product
| managers and otherwise can query the database of one app
| joined with another, especially in an environment with
| microservices. You might join a table containing orders with
| another table that was from a totally different DB, like
| payments, to find out which kinds of items are best to offer
| BNPL or something.
|
| The author also mentions that it's used for machine learning
| models which will ultimately feed back into Shopify's front
| end, for instance.
| thefourthchime wrote:
| > AWS step functions are really the best in orchestration.
|
| At our company, AWS Step is a disaster. You're effectively
| writing code in JSON/YAML. Anything beyond very simple steps
| becomes 2 pages of YAML that's very hard to read or write.
| There is no way to debug, polling is mostly unusable. Changes
| need to be deployed with CF which can take forever, or worse
| hang.
|
| It's the most one of the most annoying technologies I've used
| in my 20+ years of engineering.
| fmakunbound wrote:
| Definitely this.
|
| Our teams have many 1 or 2 step DAGs that are idempotent. They
| could have been lambdas and they're already pulling from SQS
| already. It could be just my misfortune, but in AWS, MWAA is
| kind of janky. It's difficult to track down problems in the
| logs (task failures look fine there) and the Airflow UI is
| randomly unavailable ("document returns empty", "connection
| reset" kind of things).
| mywittyname wrote:
| Lambdas have resource constraints that Airflow DAGs don't.
| Most notably, Airflow DAGs can run for any arbitrary length
| of time. And the local storage attached to the Airflow
| cluster is actual disk space, and not just a fake in-memory
| disk, making it possible process files larger than the amount
| of memory allocated to the DAG.
|
| There's certainly some functionality overlap, but I don't see
| Lambda and Airflow as competitors. Each has capabilities that
| the other doesn't.
| xtracto wrote:
| I remember reading that you can attach EFS to lambdas. That
| would solve some of the storage issues.
| byteflip wrote:
| So interesting, a lot of comments seem to be negative
| experiences. I haven't used Airflow at scale yet but would love
| to convert our extremely limited, internally built orchestrator +
| jobs over to Airflow. I think it would allow us TO scale, at
| least for some time. I think a lot of companies are still really
| behind the times. Our DAGs are fairly simple, and Airflow has
| been a major improvement in my testing. The UI is great for
| helping me debug jobs / monitoring feed health / backfilling. DAG
| writing has been a bit frustrating but is much improved format
| over the internal systems we have. Am I just naive? Is everyone
| writing extremely complex graphs? Is this operational complexity
| due mostly to K8s (I've just been playing with Celery)? Anyone
| enjoying using Airflow?
| jonpon wrote:
| The problems in this article and in the comments are some of the
| stuff we have heard at Magniv in the passed few months when
| talking data practitioners. We are focused on solving some subset
| of these problems.
|
| Personally, I think Airflow is currently being un-bundled and
| will continue to be with more task specific tools.
|
| At the very least, if un-bundling doesnt occur, Prefect and
| Dagster are working hard to solve lots of these issues with
| Airflow.
|
| Evolution of products and engineering practices is not linear and
| sometimes doesnt even make sense when looking at a-posteriori (as
| much as I would like it to follow some logical process). Will be
| interesting how this space will develop in the next year or so.
| jimmytucson wrote:
| I've used Airflow for a few years and here's what I don't like
| about it:
|
| - Configuration as code. Configuration should be a way to change
| an application's behavior _without_ changing the code. Make me
| write a workflow as JSON or XML. If I need a for-loop, I'll write
| my own script to generate the JSON.
|
| - It's complicated. You almost need a dedicated Airflow expert to
| handle minor version upgrades or figure out why a task isn't
| running when you think it should.
|
| - Operators often just add an API layer over top of existing
| ones. For example, to start a transaction on Spanner, Google has
| a Python SDK with methods to call their API. But with Airflow,
| you need to figure out what _Airflow_ operator and method wraps
| the Google SDK method you're trying to call. Sometimes the
| operator author makes "helpful" (opinionated) changes that
| refactor or rewrite the native API.
|
| I would love a framework that just orchestrates tasks (defined as
| a command + an image) according to a schedule, or based on the
| outcome of other tasks, and gives me a UI to view those outcomes
| and restart tasks, etc. And as configuration, not code!
| atombender wrote:
| What you're asking for is basically Argo Workflows, I think.
|
| Not that I recommend it. It's quite lovely in principle, but
| really flawed in practice. It's YAML hell on top of Kubernetes
| hell (and I say that as someone who loves Kubernetes and uses
| it for everything every day).
|
| Having worked with some of these tools, what I've started to
| wish for is a system where pipelines are written in just plain
| code. I'd like to run and debug my pipeline as a normal,
| compiled program that I can run on my own machine using the
| tools I already use to build software, including things like
| debuggers and unit testing tools. Then, when I'm really to put
| it into production, I want a super scalable scheduler to take
| my program and run it across dozens of autoscaling nodes in
| Kubernetes or whatever.
|
| The only thing I've come across that uses this model is
| Temporal, but it's got a rather different execution model than
| a straightforward pipeline scheduler.
| ricklamers wrote:
| The flexibility of code as configuration is indeed somewhat of
| a footgun at times. That's why with Orchest we went with a
| declarative JSON config approach.
|
| We take inspiration from the Kubeflow project and run tasks as
| containers. With a GUI for editing pipelines and managing
| scheduled runs we come pretty close to what you're asking for
| (bring an image and run a command). And it's OSS, of course.
|
| https://github.com/orchest/orchest
| KptMarchewa wrote:
| >If I need a for-loop, I'll write my own script to generate the
| JSON.
|
| That's how you end with extreme mess in logs, UI and metrics.
|
| > But with Airflow, you need to figure out what _Airflow_
| operator and method wraps the Google SDK method you're trying
| to call.
|
| Or you can use PythonOperator with hooks, that generally
| integrate external APIs with Airflow connection system.
|
| https://github.com/apache/airflow/blob/main/airflow/provider...
|
| I think the bigger problem with Airflow's Operator concept is
| N*M problem of integrating multiple systems. That's how you end
| with GoogleCloudStorageToS3Operator and stuff like that.
| rubenfiszel wrote:
| If your flow is more linear looking than a complex DAG and you
| want to get a full featured webeditor (with lsp), automatic
| dependency handling, typescript(deno) and python support, I am
| building an OSS, self-hostable airflow/airplane alternative at:
| https://github.com/windmill-labs/windmill
|
| You write the modules as normal python/deno scripts, we infer the
| inputs by statically analyzing your script parameters and we take
| care of the rest. You can also reuse modules made by the
| community (building the script hub atm).
| slig wrote:
| Thank you! Exactly what I was looking for.
| AtlasBarfed wrote:
| Isn't there an Uber workflow product? Also that scales on top of
| Cassandra?
| tomwheeler wrote:
| You're probably thinking of Temporal (https://temporal.io/),
| which is a fork of the Cadence project originally developed at
| Uber.
| 8589934591 wrote:
| I echo other comments. Running and managing Airflow beyond simple
| jobs is complicated. But then if you are running and managing
| Airflow for simpler jobs, then you might not need Airflow.
|
| One data center company that I know of uses airflow at scale with
| docker and k8s. They have a huge team of devops just to manage
| the orchestrator. They in turn have to fine tune the orchestrator
| to run smoothly and efficiently. Similar to what shopify has
| noted here, they have built on top of and extended airflow to
| take care of pain points like point 4. For companies like this it
| makes sense to run airflow.
|
| Another issue I see companies/engineers who adopt airflow is that
| they use it as a substitute for a script than as an orchestrator.
| For example, say you want to download files from an API, upload
| to s3, load it to your warehouse (say snowflake) and do some
| transformations to get your final table - instead of writing
| separate scripts for each step of fetch/upload/ingest/transform
| and call each step from the dag, they end up writing everything
| as a task in a dag. A huge disadvantage is there is a lot of code
| duplication. If you had a script as a CLI, all your dag/task has
| to do is call the script with the respective args. I agree that
| airflow comes with a lot of convenience wrappers to create tasks
| for many things but I feel this results in losing flexibility.
|
| This also results in them tying their workflow with airflow and
| any change they might need they have to modify their airflow code
| directly. If you want to modify how/what you upload to s3, you
| end up writing/modifying python functions in the respective dags'
| code. This removes the flexibility to modify/substitute any
| component of the workflow with something else or even change the
| orchestrator from airflow to something else. Additionally,
| different teams might write workflows in different ways -
| standardization of practice is really hard. This in turn results
| in pouring more investments to maintaining and hiring "airflow
| data engineers". Companies fall into steep tech debts.
|
| Prefect/dagster are new orchestrators in town. I'm yet to try
| them out but I've heard mixed reviews about them.
|
| EDIT: Forgot about upgrades. Lot of upgrades are breaking changes
| esp the recent change from 1->2. You end up spending a lot of
| time just trying to debug what went wrong. Just installing and
| running it is a pain.
| blakeburch wrote:
| Love your observation about tying the workflow to Airflow.
|
| One of my biggest annoyances in the orchestration space is that
| teams are mixing business logic with platform logic, while
| still touting "lack of vendor lock-in" because it's open
| source. At the point that you're importing Airflow specific
| operators into your script and changing the underlying code to
| make sure it works for the platform (XCom, task decorators,
| etc.), you are directly locking yourself in and making edits
| down the road even more difficult.
|
| While some of the other players do a better job, their method
| of "code as workflow" still results in the same problems, where
| workflows get built as a "mega-script" instead of as modular
| components.
|
| I'm a co-founder at Shipyard, a light-weight hosted
| orchestrator for data teams. One of our core principles is
| "Your code should run the same locally as it does on our
| platform". That means 0 changes to your code.
|
| You can define the workflow in a drag and drop editor or with
| YAML. Each task is it's own independent script. At runtime, we
| automatically containerize each task and spin up ephemeral file
| storage for the workflow, letting you can run scripts one after
| the other, each in their own virtual environment, while still
| sharing generated files as if you were running them on your
| local machine. In practice, that means that individual tasks
| can be updated (in app or through GitHub sync) without having
| to touch the entire workflow.
|
| I'm biased, but it seems crazy to me that so many engineers are
| willing to spend hours fighting the configuration of their
| orchestration platform rather than focusing on the solving the
| problems at hand with code.
| rockostrich wrote:
| We've established a rule that all "custom" code (anything that
| isn't a preexisting operator in airflow) needs to be contained
| in a docker image and run through the k8s pod operator. What's
| resulted is most folks do exactly what you said. They create a
| repo with a simple CLI that runs a script and the only thing
| that gets put in our airflow repo is the dependency
| graph/configuration for the k8s jobs.
| claytonjy wrote:
| AFAICT this is the now-recommended way to use Airflow: as a
| k8s task orchestrator. Even the Astronomer team (original
| Airflow authors) will tell you to do it this way.
| idomi wrote:
| When it comes to scale and DS work I'd use the ploomber open-
| source (https://github.com/ploomber/ploomber). It allows an easy
| transition between dev and production, incrementally building the
| DAG so you avoid expensive compute time and costs. It's easier to
| maintain and integrates seamlessly with Airflow, generating the
| DAGs for you.
| dekhn wrote:
| I tried to run airflow; I found pretty much everything about it
| to be wrong for my usecase. Why can't I easily upload a workflow
| through the UI? Why doesn't it handle S3 file staging for me?
| jonpon wrote:
| Would love to hear more about your use case and your issues --
| you can sign up on our website (magniv) or send me an email jon
| at our domain.
| hatware wrote:
| It definitely takes some time getting used to the quirks of
| Airflow. I know it took 6 months of running it at my last gig
| to really understand what was happening underneath the UI.
|
| With great control comes great responsibility.
| dekhn wrote:
| actually I concluded it was just not that great a workflow
| engine. It's probably just intended for a different use case
| than mine.
| ashtonbaker wrote:
| we run airflow with ... considerably more dags than this. our
| main "lesson learned" is that airflow should not be used "at
| scale".
| blakeburch wrote:
| Lots of the comments here seem to be commenting on their own
| experiences of complexity and frustration with Airflow, but I'd
| venture to say that's most data orchestration tools. In fact,
| that sort of feedback is so consistent that I'm half tempted to
| start a podcast for "orchestration horror stories" (contact if
| interested).
|
| What I've found while building out Shipyard, a hosted lightweight
| orchestration platform, is that teams want something that "just
| works". Servers that "just scale". Observability that doesn't
| require digging. Notifications and retries that work
| automatically. Workflows that don't mix business logic with
| platform logic. Code and workflows that sync with git. Deployment
| that only takes a few minutes.
|
| For the straightforward use cases, where you need to run tasks A
| -> G daily, with a bit of branching logic, Airflow is overkill.
| Yes, Airflow has a lot of great complex functionality that can
| help you down the road. But Airflow keeps getting suggested to
| everyone even if it's not best suited to their use case,
| resulting in lots of lost time and engineering overhead.
|
| While I have definitely have bias, there are a lot of other high
| quality alternatives out there to explore nowadays!
___________________________________________________________________
(page generated 2022-05-23 23:00 UTC)