[HN Gopher] Airflow's Problem
       ___________________________________________________________________
        
       Airflow's Problem
        
       Author : cloakedarbiter
       Score  : 231 points
       Date   : 2022-08-02 12:10 UTC (10 hours ago)
        
 (HTM) web link (stkbailey.substack.com)
 (TXT) w3m dump (stkbailey.substack.com)
        
       | jxi wrote:
       | I've had good success with https://dagster.io, which is much more
       | opinionated about your pipelines, including properly typing
       | inputs and outputs.
        
       | calebm wrote:
       | > If it sounds like you could simply replace Airflow with
       | basically any other job execution engine, that's because you
       | could.
       | 
       | Has anyone tried Luigi for data engineering pipelines?
        
       | datascientist wrote:
       | Recent perspectives from the creators of Prefect, Dagster, Flyte,
       | and Orchest => https://gradientflow.com/summer-of-orchestation/
        
         | claytonjy wrote:
         | Seems to be missing Temporal/Cadence, which I'm very excited
         | about, but I've never heard of Flyte or Orchest.
        
       | ciguy wrote:
       | My current company has been having a lot of success with Dagster.
       | It seems to give a lot more flexibility thanAirflow in terms of
       | defining the pipeline and where to run it. It's also a bit
       | friendlier when things fail and need to be backfilled or retried
       | IMO. Airflow feels like it's somewhat legacy at this point in
       | time. It served a need well but the needs have changed now.
        
       | ForHackernews wrote:
       | Airflow is super clunky, but it gets the job done (mostly).
       | 
       | I'm kind of a fan of Prefect as an alternative:
       | https://docs-v1.prefect.io/core/about_prefect/why-not-airflo...
        
       | didip wrote:
       | The problem with airflow is its scalability (or lack thereof) and
       | DAG dependency management.
       | 
       | Airflow successors must figure out how to distribute the cron and
       | all dependencies should be self contained in a Docker image.
        
       | pdinny wrote:
       | The post feels like a bait-and-switch in the sense that it
       | presents itself as about Airflow's shortcomings but focuses
       | mostly on problems that Airflow doesn't attempt to solve.
       | 
       | Airflow can certainly be frustrating and it doesn't solve _all_
       | workflow orchestration problems. Surely the same thing can be
       | said of many tools? This seems mostly like a mismatch of
       | expectations.
        
         | cturner wrote:
         | When we engage in complex work, it is important to keep our
         | options open so we can direct our best effort to the hardest
         | problems.
         | 
         | It is rarely clear what the hard problems will be when new to a
         | domain. Only as scale kicks in.
         | 
         | We are constantly pitched frameworks that sell themselves as a
         | good approach to a domain, but then obstruct engagement with
         | the hardest problems when it matters. The developer becomes
         | captive of the system that claimed it would steer them right.
         | 
         | This is particularly true of fields where the hard problems are
         | integration problems which, by their nature, cannot be
         | outsourced to frameworks.
        
         | MontyCarloHall wrote:
         | 100% agreed. By analogy, an article titled "pthreads' problem"
         | should be about shortcomings in the POSIX multithreading model,
         | not an article saying that the implementation of machine-level
         | parallelism is irrelevant because Kubernetes exists.
        
         | evrydayhustling wrote:
         | FWIW the author is pretty direct about this. After the cute
         | beginning, he basically says that his problem is with Airflow's
         | scope, not it's execution.
        
           | pdinny wrote:
           | Hence the bait and switch. IMHO increasing the scope of
           | Airflow (or any tool) is a challenging proposition. Would you
           | rather use very few mega-tools with very broad scope (and
           | potentially more challenging domain to navigate) or fewer
           | more specialised tools that interoperate well together?
           | 
           | Obviously there are trade-offs with either approach, but then
           | I'd argue that making Airflow solve more problems will
           | introduce more trade-offs too.
        
           | peteradio wrote:
           | Having a poor scope is a problem because people will just
           | choose not to use you.
        
       | glogla wrote:
       | This is why I though the shift from "orchestrate jobs" to "keep
       | track of state of assets" that Dagster is trying to do is pretty
       | important. But it sems it might not be enough - it still keeps
       | clunky (pythonic) interface and I don't know how well it does
       | multi-tenancy.
        
         | sethjr5rtfgh wrote:
         | Could you clarify what "keep track of state of assets" means?
        
           | geoffjentry wrote:
           | It's a mindset shift to a more declarative model. The idea
           | has also popped up in other niche orchestrators.
           | 
           | This is an oversimplification but IMO the easiest way of
           | picturing it is instead thinking of defining your graph as a
           | forward moving thing w/ the orchestrator telling things they
           | can be run you shift to defining your graph _nodes_ to know
           | their dependencies and they let the orchestrator know when
           | they 're runnable.
        
           | glogla wrote:
           | https://www.dagster.io/blog/software-defined-assets
           | 
           | Of course, nothing stops Airflow or other tools fron thinking
           | this way as well.
        
       | foxbee wrote:
       | Like a lot of software, it's matured and along the way has put on
       | some weight. I still love it for certain use cases, but tools
       | like Dagster are peeking my interest.
        
       | htahir111 wrote:
       | Wouldn't it make sense to decouple the orchestration later from
       | the authoring layer for the dags? That way you could solve the
       | authoring problems separately from the lower level orchestration
       | problems. We're trying this over at ZenML (https://zenml.io) but
       | have yet to get feedback
        
       | stkbailey wrote:
       | Author here - appreciate the comments and reads. To add a bit of
       | color -- I spent about a month looking into orchestrators to
       | migrate Whatnot's data platform onto earlier this year, and it
       | was a miserable experience. We were on AWS Managed Airflow, but
       | to stay on it and have a solid platform, I would have been
       | writing Github Actions for CI/CD, standing up ECR and IAM roles
       | with Terraform, setting up EKS to run Kubernetes jobs, managing
       | infra monitoring with Datadog, etc., etc.
       | 
       | In fact, I did end up doing all those things, but we opted for
       | Dagster Cloud, because of their focus on improving developer
       | efficiency. Their team provided pre-built Github actions for
       | CI/CD and recently introduced PR-specific branch deployments,
       | which has been amazing. They're moving towards serverless
       | execution, built-in ECR repositories, managed secrets. Prefect
       | and Astronomer I expect are moving in this direction, too, but I
       | liked the Dagster project's energy quite a bit.
       | 
       | As I've waded into the MLOps world as well, it just keeps looking
       | like every platform basically devolves into : an orchestrator
       | that provisions compute resources and logs metadata into an
       | opinionated data model. Catalog tools like Atlan are metadata
       | sinks that are trying to build out orchestration/workflow
       | capabilities. dbt Cloud of course is just an orchestrator for a
       | specific type of data product that is aiming to operationalize
       | metadata with its metrics layer.
       | 
       | Orchestration + a metadata data model is a common denominator
       | here, and I think the fact that Airflow is so inevitable has made
       | it really hard for people to imagine the category as anything
       | other than a scheduler, but perhaps some of these new companies
       | can break new ground.
        
         | portable_hotpot wrote:
         | In your investigation did you try out Flyte at all?
        
           | stkbailey wrote:
           | Nope -- just MWAA, Astronomer, Dagster Cloud, and Prefect
           | Cloud. In the past I used Argo Workflows pretty extensively
           | and have talked about its pros and cons here:
           | https://www.youtube.com/watch?v=-cyr_kL-9fc
        
         | awinder wrote:
         | "We were on AWS Managed Airflow, but to stay on it and have a
         | solid platform, I would have been writing Github Actions for
         | CI/CD, standing up ECR and IAM roles with Terraform, setting up
         | EKS to run Kubernetes jobs, managing infra monitoring with
         | Datadog, etc., etc."
         | 
         | DAGs can be published to S3 for cutting down on like half of
         | these dependencies. And the nice thing about MWAA is log &
         | stats publishing over cloudwatch, which should flow into any
         | existing amazon integrated tooling.
         | 
         | For our team setting up terraform for iam & mwaa, some deploy
         | pipelines to s3, and connecting some config bits to wire up
         | splunk logs / monitoring pieces was not that much work.
         | Initiating a separated vendor relationship & pricing out data
         | ingress/egress costs would blow that work out of the water but
         | maybe it's a difference in company size/placement.
        
         | jw887c wrote:
         | >We were on AWS Managed Airflow, but to stay on it and have a
         | solid platform, I would have been writing Github Actions for
         | CI/CD, standing up ECR and IAM roles with Terraform, setting up
         | EKS to run Kubernetes jobs, managing infra monitoring with
         | Datadog, etc., etc.
         | 
         | This sounds like an issue not with Airflow but with
         | integration.
        
           | stkbailey wrote:
           | Yep, that's what I tried to point out in the article.
        
         | theptip wrote:
         | Thanks for the experience report - I have Dagster and Prefect
         | on my shortlist to evaluate next time I need to build this, and
         | Dagster seems the most promising, so it's good to get another
         | datapoint.
         | 
         | One Q - it seems to me that another possible solve (and
         | probably how the big guys tend to do it) is to use a dataflow
         | engine like Spark/Flink. Did you compare a managed platform
         | like Google Dataproc? They also have serverless if you don't
         | want a heavy managed cluster, which might make this approach
         | more viable for non-huge companies that wouldn't utilize a min-
         | spec cluster. (When I last evaluated this they didn't have
         | serverless which was a dealbreaker for my small scale).
        
       | chazeon wrote:
       | I need to deal with system with complex dependency relationship.
       | The system need to stop executing when some step fails. I used to
       | look at Airflow. I remember at the time it was the first result
       | if you Google DAG something. But it has the same problem as data-
       | engineer centric system that it was somewhat over-engineered and
       | it relies too much on a center server. It's heavily relying on a
       | Python runtime is also something I don't like, feels like
       | everything else is a second-class citizen. The nature of our job
       | is that we have deal with complex legacy codes, some Fortran
       | program, some need a conda environment etc. Later I found what I
       | should look at is just some makefile-like system, and I settled
       | on Snakemake, which has a nice DSL that forces you to be explicit
       | about input / output / etc.. Probably Airflow is just not the
       | right tool for a one man team.
        
       | tchiotludo wrote:
       | All the issues described in this post lead me to create Kestra
       | [0] . Airflow was a true revolution when it was open-source and
       | we need thanks its innovation. But I totally agree that a large
       | static dag is not appropriate in the actual data world with data
       | mesh and domain responsibility.
       | 
       | [0] https://github.com/kestra-io/kestra
        
       | dusted wrote:
        
         | barbecue_sauce wrote:
         | Is an Apache open source project considered a hipster-corp?
        
         | lopatin wrote:
         | So to confirm, you are annoyed that the name of this project is
         | a word.
        
           | dusted wrote:
           | Definitely. I'd like the name of this project to be changed
           | to be a name, not a common word. Thanks.
        
             | lopatin wrote:
             | I'd suggest to raise the issue in the devlist or on GitHub
             | Issues to get more visibility.
        
               | dusted wrote:
               | I'd be okay with the article headline on HN being "Apache
               | Airflow's Problem" so that I know it's about a piece of
               | apache software and not something interesting about
               | airflow.
        
               | tomwheeler wrote:
               | The Apache Software Foundation would also prefer this,
               | per their trademark policy:
               | https://www.apache.org/foundation/marks/faq/#guide
        
               | lopatin wrote:
               | I think context matters and the title "Airflow's Problem"
               | doesn't make much sense when talking about the physical
               | phenomenon of flowing air.
        
               | morelisp wrote:
               | Not really, the past two years have seen probably a order
               | of magnitude increase in interest and articles about
               | industrial-scale ventilation and air quality.
        
               | lopatin wrote:
               | You still wouldn't phrase it like "Airflow's Problem"
               | because the concept of airflow is incapable of having a
               | problem. It just exists. Some theories _about_ airflow
               | may have problems, but then you would specify that in the
               | title.
        
       | orthoxerox wrote:
       | A few years ago a new guy at our DWH team tried to sell Airflow
       | to the rest of the team. They invited me to listen to his talk as
       | well, and I was baffled why something so trivial as Airflow was
       | being sold as a critically important piece of infrastructure.
       | 
       | Why would I need a glorified server-side crontab if something
       | like MS DTS from 1998 could do the same, but better? Sure, Python
       | is probably better than whatever DTS generated, but the ops don't
       | care either way, since Airflow doesn't care what it's running.
       | 
       | Something as simple as "job A must run after job B and job C, but
       | if it doesn't start by 2am, wake up team X. If it doesn't finish
       | by 4am, wake up team Y" isn't Airflow's problem, it's your
       | problem.
       | 
       | "What's the overall trend for job D's finish time, what is the
       | main reason for that?" isn't Airflow's problem, it's your
       | problem. "What jobs are on the critical path for job E?" isn't
       | Airflow's problem, it's your problem.
       | 
       | "Job F failed for date T and then recursively restart everything
       | that uses its results for date T" isn't Airflow's problem, it's
       | your problem.
        
         | coldtea wrote:
         | > _and I was baffled why something so trivial as Airflow was
         | being sold as a critically important piece of infrastructure._
         | 
         | https://news.ycombinator.com/item?id=9224
         | 
         | > _Something as simple as "job A must run after job B and job
         | C, but if it doesn't start by 2am, wake up team X. If it
         | doesn't finish by 4am, wake up team Y" isn't Airflow's problem,
         | it's your problem._
         | 
         | I guess that's one approach to job security. And why not make
         | data egress manual too? Why transfer data through the network,
         | when you can print them, mail the papers, and type them back
         | in? Data input is not the computer's problem, it's your
         | problem!
         | 
         | > _Something as simple as "job A must run after job B and job
         | C, but if it doesn't start by 2am, wake up team X. If it
         | doesn't finish by 4am, wake up team Y" isn't Airflow's problem,
         | it's your problem. "What's the overall trend for job D's finish
         | time, what is the main reason for that?" isn't Airflow's
         | problem, it's your problem. "What jobs are on the critical path
         | for job E?" isn't Airflow's problem, it's your problem. "Job F
         | failed for date T and then recursively restart everything that
         | uses its results for date T" isn't Airflow's problem, it's your
         | problem._
         | 
         | The whole idea of writing programs is making things
         | automatable. That is, making them the computer's problem, not
         | our problem. We get the higher level problem of writing the
         | automation once, and fixing any bugs in our code, then we get
         | to enjoy putting it to work for us...
        
         | robertlagrant wrote:
         | > isn't Airflow's problem, it's your problem
         | 
         | This is a baffling statement.
        
       | Hippocrates wrote:
       | I despise airflow and how cemented it is as data infrastructure.
       | It such a useful and basic concept but a nightmare to manage, and
       | it works like junk. It's taken me 3 separate jobs over 7 years to
       | realize that it's probably not our fault. Everyone seems to
       | struggle with the same things: flaky scheduler that is slow to
       | run tasks, confusing and redundant sounding settings that apply
       | at up to three different levels (environment, job, task). It
       | invites less experienced users to write a sea of spaghetti code
       | in a monolithic DAGs repo. People wind up doing heavy data
       | munging in python operators, which clobbers scalability and
       | reliability. It also can't handle a large number of parallel
       | tasks or frequent runs. It seems to have miserable scalability
       | for the resources given, and bad controls for auto scaling. The
       | UI feels dated and unintuitive. XComs seem useful to everyone but
       | work like crap and actually an anti-pattern.
       | 
       | I've also tried it on Cloud Composer (google managed) and
       | automated upgrades always trashed the cluster. It's not well
       | designed for GKE because it writes logs to files and requires
       | stateful sets. Testing the code is a huge burden due to the vast
       | environment and dependencies needed to make it work locally.
       | 
       | I'm eager to rid my life of it and test out temporal for some of
       | the high concurrency/frequency cases we have.
        
         | zibarn wrote:
         | Not experienced here but as a genuine interest can you tell
         | what problems airflow solves that can't be handled by celery
         | and rabbitmq?
        
           | code_biologist wrote:
           | An analogy is "can you tell what problems Django solves that
           | can't be handled by wsgi and psycopg?" Nothing fundamentally
           | different, but life is a whole lot easier with Django.
           | Honestly if you're doing data engineering and you haven't
           | spent time with a good DAG runner, you're doing yourself a
           | real disservice.
           | 
           | My sibling comment did a good job explaining, but the UI +
           | configurable storage + configurable triggers all out of the
           | box make life a lot easier.
        
           | Hippocrates wrote:
           | I have not used celery + rabbitmq but I assume that combo is
           | like sidekiq + redis, or any other job queue + worker system.
           | 
           | Airflow packages those things together and adds some
           | additional features - UI with Graph, gantt, logs and other
           | views of the workflow - Users and permissions - Places to
           | store config - Mechanisms for passing small data between
           | tasks - Various "sensors" for triggering workflows - Various
           | operators that interact with common data-oriented systems
           | (bigquery, snowflake, s3, you name it). These are basically
           | libraries that expose a config-forward API.
           | 
           | Probably the main selling point is the pre-made operators,
           | but in short it is a complete solution with bells and
           | whistles that aligns itself with the data ecosystem.
        
         | bushbaba wrote:
         | The idea behind airflow is great. What sucks is people using it
         | to do heavy processing. Maybe with serverless/k8s airflow could
         | fan out the processing to a cluster to allow for flexibility.
         | But then, I guess you end up re-writing spark et-al.
        
       | patwater10 wrote:
       | <shrug>
       | 
       | ETL seems just like one of those perennial challenges that resist
       | humanity's efforts to categorize the world into need and tidy
       | boxes
        
       | markus_zhang wrote:
       | I don't really agree with decentralized ETL. You can't imagine
       | the mess people proudly wrote. And yeah I have been on the other
       | aide of the trench too.
        
       | shmoogy wrote:
       | I've been using airflow for about 2 years now in production. It's
       | been mostly good - the few times things go wrong, it's a huge
       | pain in the ass to figure out why... but it's significantly
       | better than just straight cron on Linux. Airflow 2 has improved a
       | lot of speed and catching up issues from airflow 1.x
       | 
       | I don't have time to investigate other solutions like dagster and
       | prefect and migrate jobs to it for testing.
        
         | glogla wrote:
         | If it is just you, you are fine, and I'm not sure other tools
         | would have that much benefit.
         | 
         | Trouble with Airflow starts when multiple teams and user types
         | start to share it.
        
           | shmoogy wrote:
           | I've definitely noticed more issues after adding users, but
           | it's more that they don't actually understand a lot of what
           | they're trying to do and cause problems when writing dags.
        
             | glogla wrote:
             | Yeah, Airflow isn't multi-tenant.
             | 
             | People can potentially overwrite each other's DAGs.
             | Credential management is complicated. Broken DAG can stop
             | whole Airflow. Slow DAG can impact performance of whole
             | Airflow. Getting DAGs to wait for each other (like one team
             | prepares data up to a point and then other team builds on
             | that) is kind of a nightmare. Sometimes people want
             | features from newer Airflow, but some other team built DAG
             | that isn't forward compatible. Etc etc.
             | 
             | But I'm not sure there actually is a better solution
             | elsewhere. At least I have not seen it yet, maybe Dagster
             | is on a good road.
             | 
             | But as I said, for centralized solutions it works really
             | well.
        
       | kumare3 wrote:
       | I remember having this feeling a few years ago. What I realized
       | is that airflow has taught us a few bad habits and also brought
       | ahead an interesting paradigm of the vertical workflow engine.
       | 
       | I agree airflow is old, legacy and ideally folks should not use
       | it, reality is there is a lot of pipelines already built with it
       | - sadly. I think as a community we have to start moving away from
       | it for more complicated problems.
       | 
       | Disclaimer: I created Flyte.org and heavily believe in
       | decentralized development of DAGs and centralized management of
       | infrastructure
        
       | edumucelli wrote:
       | More recent tools such as Dagster and Prefect have much more to
       | offer. One simple example is communication between tasks. Airflow
       | has a clunky system for that called XCom. The actual author of
       | XCom says you probably should not use it due to the level of
       | hackery it has under the hood [1]:
       | 
       | On Dagster and Prefect you communicate between tasks as if you
       | were writing pure Python. On Airflow on the other hand ...
       | 
       | [1] https://www.youtube.com/watch?v=TlawR_gi8-Y&t=740s
        
         | marcinzm wrote:
         | I'm not sure if linking to a talk from 2018 in 2022 for a
         | project that is being actively worked on (a bunch of Python
         | abstractions for xcom were added in 2.0) is fair.
        
           | byteflip wrote:
           | Yes, curious if the Taskflow API introduced as part of
           | Airflow 2.0 reduces this pain. It appears much easier/saner
           | than working with XCOMs directly - less coupling and removes
           | the need for lots of boilerplate code.
        
         | karog wrote:
         | There's also Flyte, which is natively data aware and schedules
         | tasks around data dependencies. The syntax is essentially pure
         | python too.
        
       | MontyCarloHall wrote:
       | Dismissing Airflow for not being Astronomer is like dismissing
       | Linux for not having the capabilities of a large-scale
       | hypervisor.
       | 
       | Replace "Airflow" with "Linux," "data engineers" with "systems
       | programmers," and "Astronomer" with your hypervisor of choice
       | (Xen/VMWare/etc.), and you can see how absurd the author's point
       | is:                  My problem is that ~Airflow~ Linux was not
       | designed to address [high-level systems architecture] problems.
       | We don't need a better [Linux], but we need a higher-level one: a
       | system that enables ~data engineers~ systems programmers to think
       | at a platform level.             In fact, [Linux] is already
       | displaced. [Linux] qua [Linux] is already obsolete, and it
       | happened right within the [Linux] ecosystem. It's called
       | ~Astronomer~ Xen/VMWare/etc.            If it sounds like you
       | could simply replace [Linux] with basically any other ~job
       | execution engine~ operating system, that's because you could.
       | 
       | This is where the argument falls apart. Yes, for very large,
       | complex deployments, higher-level orchestration is important, but
       | the choice of low-level execution engine is also still hugely
       | relevant, just as the choice of guest OS is still hugely relevant
       | when discussing large deployments of VMs.
       | 
       | Furthermore, very few people actually need very large scale
       | deployments; user experience and capabilities at the low-level
       | are what most users actually care about.
        
         | orwin wrote:
         | Honestly, we have to set up airflow at my job for some datalog
         | collection and treatment. Which is fine, only i'm pretty sure
         | we had exactly the same issue at my old job that we fixed in
         | half a day, including testing and deployment, with a perl
         | script. And i think in this particular instance (gitlab logs)
         | it was treated with 90% Awk. Meanwhile my coworkers still have
         | issues after almost a week (not all of this is on airflow, but
         | still).
         | 
         | I'm not saying Airflow is bad (we did set up a lot of hadoop
         | clusters and other apache products at my old job, and our
         | clients used airflow a lot), but i think the evangelists are so
         | good they push airflow for everything, and this is bad. OP did
         | use airflow for something it was not really designed for, and
         | it sucked, but i do have this impression that tech writers and
         | apache evangelists deserve some of the blame.
        
         | mywittyname wrote:
         | Managed Airflow doesn't even solve any of the author's outlined
         | frustrations. It keeps the "obscene" syntax, it's still
         | stateless, it's not "decentralized" etc.
         | 
         | Honestly, the article is so disingenuous that it comes off like
         | a paid-for puff piece for Astronomer. It's the article-
         | equivalent of the late-night infomercial guy who rips open a
         | bag of potato chips like the hulk because he doesn't have this
         | special tool that's just four easy payments of $9.99.
        
           | glogla wrote:
           | FYI, the infomercials with the strange tools fixing strange
           | problems are usually focused on old or disabled people.
           | Opening bag of chips with ridiculius tool sounds stupid, but
           | it might help a stroke survivor or someone with one arm - but
           | the sellers don't want to show those struggle on the screen
           | to avoid humiliating people, so you see pefectly healthy
           | looking young people spilling things like they have some
           | neurodegenerative disorder or something. Because the target
           | audience might.
           | 
           | Not saying infomercials people are angels, of course, but I
           | wanted to sharethus somewhat nonobvious context.
           | 
           | (To stretch the metaphor, Airflow management system that
           | gives everyone their own Airflow might be ridiculous but make
           | sense for companies where cooperation is difficult :))
        
           | TTPrograms wrote:
           | The new TaskFlow API has been part of AirFlow 2.0 since its
           | release in 2020: https://airflow.apache.org/docs/apache-
           | airflow/stable/tutori...
        
           | stkbailey wrote:
           | Interesting, Astronomer was actually my last choice for
           | orchestrator. We went with Dagster, but I didn't want to make
           | the takeaway "Dagster solves these problems", because it
           | doesn't directly. Astronomer was just the best foil for the
           | "meta-orchestrator" space that seems to be evolving, and
           | which _can_ address these problems.
        
         | hatware wrote:
         | Agreed. The author is blaming Airflow for what are ultimately
         | poor architecture decisions.
         | 
         | I will admit it's not easy to figure out best practices with
         | Airflow, but if you make bad decisions and your system doesn't
         | scale with the problem, you didn't understand the problem or
         | how to solve it in the first place. The tools you chose are
         | second to that.
        
           | peteradio wrote:
           | You may not know very precisely the time constants you are
           | dealing with in your problem until you give it a shot.
        
       | pharmakom wrote:
       | I don't understand the desire to describe DAGs in Python... it's
       | a fine scripting language but pretty horrible for this
       | declarative description stuff.
        
       | throwaway787544 wrote:
       | Snowflake is the future. We shouldn't be writing code at all, we
       | should be throwing data into a big hole in the ground and then
       | querying the hole, or attaching a query to the hole so as data
       | goes in the query does something wjth it, and chaining those
       | queries. The fact that anyone is writing anything more complex
       | than SQL to do this is a failure of imagination. Snowflake is
       | intended to remove all the unnecessary engineering from you just
       | putting data somewhere and doing something useful with it easily.
        
       | rldjbpin wrote:
       | Having been forced to work with an obsolete version of Airflow at
       | work, I can attest to how narrow-minded the project's focus was
       | when it was originally created. The scheduling quirks and UTC
       | defaults are enough to paint the picture here.
       | 
       | Not completely sure if most of the issues I've faced were
       | resolved in the future releases, but I don't fully agree with the
       | take of the article. Like go with the scheduler that works for
       | your current and potential future needs. The reason why we
       | continue to use Airflow despite the issues is because it works so
       | well with our workflows. This does mean that I would recommend it
       | to another team.
        
       | llambda wrote:
       | To address a point the author makes: I'm entirely unconvinced the
       | "shift left" mentality of data democracy (aka business operators
       | should write sql) is actually shifting left or a worthy path to
       | pursue for most businesses. More recently this 2010s fad seems to
       | be dying and in favor we're seeing centralized data efforts that
       | produce data products.
       | 
       | One of the most significant pitfalls of data is failing to
       | interrogate the value it provides and assuming that if you give
       | everyone access all the time the magic will happen. The truth is
       | value does not simply materialize just as value does not
       | magically spring from computers by a human powering it on (okay
       | sure, you may have already automated the value but that's
       | actually the point I'm about to make). In both cases it requires
       | an experienced practitioner who collaborates with a larger team
       | to intersect their work with the business needs.
       | 
       | Data is tricky, all the more so because it's often seen as a
       | panacea by business leaders who aren't connected with the work of
       | extracting that value.
        
         | itsoktocry wrote:
         | > _that if you give everyone access all the time the magic will
         | happen_
         | 
         | There's much ongoing discussion about this is the data world,
         | often revolving around "self-service analytics".
         | 
         | Unless you're talking about "our analysts don't have to clean
         | data all the time", which, for a large enough organization
         | makes sense, "self-service" for non-technical folks is futile
         | and pointless. They need specific answers to specific
         | questions, not the ability to infinitely explore the data.
         | Organizations should _desire_ that kind of focus, not prevent
         | it.
        
           | datavirtue wrote:
           | They idea was that they were going to hire an army of data
           | scientists and become google...magically.
           | 
           | Reality smacked that shit down hard. I left data engineering
           | because the projects were all over the place, wildly
           | undisciplined and unfocused.
           | 
           | You were lucky to have source control let alone an
           | understanding from the business that these projects were in
           | fact software development.
           | 
           | I switched back to software engineering because at least
           | there is a faint realization that we are...building software.
           | 
           | I might go back when the dust clears.
           | 
           | "Why do we need to hire programmers...I thought we needed
           | data engineers?"
           | 
           | "Because the data pipelines are all built with thousands of
           | lines of code. Java, python, Fortran, you name it...and your
           | job post only mentioned SQL and data modelling"
           | 
           | I could go on forever.
        
           | mason55 wrote:
           | This is the constant argument I have with people about data
           | products.
           | 
           | You don't need to expose more dimensions or get the users
           | more access to the raw data. You need to understand what
           | their business is and what their business problems are and
           | help them answer those specific questions quickly and
           | succinctly.
           | 
           | Yes, there are certainly times where people use huge amounts
           | of raw data to uncover the answer to a question they didn't
           | know they had. But it's rare, it's expensive to support, and
           | most businesses are going to be able to do anything with it
           | anyway (a whole org built to do X isn't suddenly going to
           | shift to do Y because you discovered some insight in a random
           | report).
        
         | abirch wrote:
         | I've seen data errors because of joins and aggregations. Data
         | democratization can be a net negative, especially if people
         | don't question the graphs they see.
        
         | mumblemumble wrote:
         | With all credit due to Google's excellent and under-appreciated
         | paper _Machine Learning: The High Interest Credit Card of
         | Technical Debt_ [1], I submit that Big Data is the high
         | interest home equity line of credit of business operations
         | debt.
         | 
         | It's not that big data tools aren't useful. It's that, when you
         | just start amassing huge piles of data without a clear up-front
         | plan for how it will be used, and assume that a whole bunch of
         | people who have never heard of sampling bias or multiple
         | comparisons bias or Coase's Law [2] can figure out what to do
         | with it later, you're setting yourself up for a Bad Time.
         | 1: https://research.google/pubs/pub43146/        2: "If you
         | torture the data long enough, it will confess."
        
           | htrp wrote:
           | > I submit that Big Data is the high interest home equity
           | line of credit of business operations debt.
           | 
           | I like this but it's kinda like the payday loan of business
           | operations.
        
           | abirch wrote:
           | I'd say that Big Data is the Collateralized Debt Obligations
           | of business operations. It looks fabulous from afar but it
           | can blow things up quickly if there's no understanding of the
           | internals.
        
           | systemvoltage wrote:
           | Yet, we abide by data-oriented conclusions outside of
           | software engineering all the time. From Academics papers to
           | FDA to crime statistics.
        
             | mumblemumble wrote:
             | I won't say any of those are perfect. But there's at least
             | a little more effort toward responsible data analysis in
             | academia. The FDA brings an interesting example to mind.
             | Take a look at how, on paper, drugs suddenly magically
             | became less effective when the FDA started requiring
             | clinical trial pre-registration in 2007.
             | 
             | It's also worth noting that, over the past few decades,
             | most academic fields have been getting increasingly
             | skeptical of the value of correlative research on pre-
             | existing data sets. Even among people who have been
             | extensively trained in how to do it properly. And yet, the
             | vast majority of big data business plans I've seen in
             | practice boil down to "collect a huge data set and then let
             | people do correlative research on it."
        
               | systemvoltage wrote:
               | Agreed, I want more scrutiny than some entity flashing
               | "Here is the data". It can easily be exploited behind the
               | veneer of data-based-credibility.
        
       | jon_adler wrote:
       | Is there any love for the Argo [1] project suite (Workflows,
       | Events, CD) for this type of use case? I haven't tried it out
       | myself yet however it does look interesting.
       | 
       | [1] https://argoproj.github.io
        
         | ricklamers wrote:
         | Argo is pretty amazing if you want to take advantage of the
         | work Kubernetes has done to scale resource efficiently across a
         | cluster of compute nodes.
         | 
         | If you're looking for something that's a bit more high level
         | and friendly to expose directly to your data team (data
         | scientists/data engineers/data analysts) you can check out
         | https://github.com/orchest/orchest
         | 
         | You can think of it as a browser UI/workbench for Argo
         | scheduled pipelines. Disclaimer: author of the project
        
         | ForHackernews wrote:
         | I've never used their workflows thing, but having been forced
         | to live with ArgoCD it sounds horrifying.
         | 
         | Argo is another over-engineered "CNCF" thing trying to ride the
         | Kubernetes hype train. It's all "eventually consistent", which
         | makes it extraordinarily difficult to see when any particular
         | thing actually happened. Is my code deployed? Who knows, Argo
         | is "syncing".
         | 
         | Check out these great docs: https://argoproj.github.io/argo-
         | workflows/rest-api/
         | 
         | > API reference docs :
         | 
         | > Latest docs (maybe incorrect)
         | 
         | > Interactively in the Argo Server
         | UI.<https://localhost:2746/apidocs> (>= v2.10)
         | 
         | Yes, that is a localhost URL on their website.
        
           | robertlagrant wrote:
           | How do you know if anything is deployed if it hasn't come
           | back and confirmed it's deployed? Manual only?
        
       | saltmeister wrote:
        
       | biellls wrote:
       | I have my own opinion on Airflow's pain points and created
       | Typhoon Orchestrator (https://github.com/typhoon-data-
       | org/typhoon-orchestrator) to solve them. It doesn't have many
       | stars yet but I've used it to create some pipelines for medium
       | sized companies in a few days, and they've been running for over
       | a year without issues.
       | 
       | In particular I transpile to Airflow code (can also deploy to
       | Lambda) because I think it's still the most robust and well
       | supported "runtime", I just don't think the developer experience
       | is that good.
        
       | jdoliner wrote:
       | I was at Airbnb when we open-sourced Airflow, it was a great
       | solution to the problems we had at the time. It's amazing how
       | many more use cases people have found for it since then. At the
       | time it was pretty focused on solving our problem of
       | orchestrating a largely static DAG of SQL jobs. It could do other
       | stuff even then, but that was mostly what we were using it for.
       | Airflow has become a victim of its success as it's expanded to
       | meet every problem which could ever be considered a data
       | workflow. The flaws and horror stories in the post and comments
       | here definitely resonate with me. Around the time Airflow was
       | opensource I starting working on data-centric approach to
       | workflow management called Pachyderm[0]. By data-centric I mean
       | that it's focused around the data itself, and its storage,
       | versioning, orchestration and lineage. This leads to a system
       | that feels radically different from a job focused system like
       | Airflow. In a data-centric system your spaghetti nest of DAGs is
       | greatly simplified as the data itself is used to describe most of
       | the complexity. The benefit is that data is a lot simpler to
       | reason about, it's not a living thing that needs to run in a
       | certain way, it just exists, and because it's versioned you have
       | strong guarantees about how it can change.
       | 
       | [0] https://github.com/pachyderm/pachyderm
        
         | carlsborg wrote:
         | Cool. Are there any published benchmarks on how the data
         | versioning engine scales?
        
       | FridgeSeal wrote:
       | > Shift 1: "We know the lineage" to "We know what in god's name
       | is happening"
       | 
       | Bro I can't even get my company to the _first_ part, and we're
       | collectively already having issues with the second? What is
       | everyone else's read on this situation in general? Do you all
       | have row and table level lineages for your data? For pipelines
       | that people are actively using? Every company I've ever been in
       | can hardly figure out where finance gets last years "magical
       | excel sheet", let alone be close to a spot where they're actively
       | using data lineage tools.
       | 
       | I also don't like Airflow, but for somewhat different reasons.
       | 
       | I think it couples orchestration and transformation too tightly,
       | I don't understand the desire to integrate everything with your
       | actual runtime Python code - I think it's markedly the wrong
       | level of abstraction/integration and limits your engineering
       | capacity. There's undoubtedly some good engineering, it's come a
       | long way, and it's mighty popular, but every time I look at a
       | repo that uses it, the only read I get is "cross-cutting-chaos".
        
         | OrangeMonkey wrote:
         | In some fields its more important than others.
         | 
         | In life sciences research to support synthetic control arms,
         | the FDA is caring more about the lineage/manipulation of the
         | data than the data science models used to predict X/Y/Z.
         | 
         | IE - what was the data originally, what did it end up as prior
         | to ingestion into AIML, why was it changed, what steps were
         | involved, etc.
         | 
         | There are not a ton of good out of the box solutions for data
         | lineage and its driving me nuts.
         | 
         | We have Apache NIFI which promises data lineage out of the box
         | and _appears_ to deliver. I've never implemented it though.
         | 
         | We have pachyderm which has some support here but I don't know
         | about it.
         | 
         | Besides that it appears roll-your-own.
         | 
         | I kind of wish there was an accepted best practice for data
         | lineage but its - surprisingly - wild west. And its completely
         | 100% required for industry use.
        
           | tomrod wrote:
           | DBT does pretty well?
        
       | dominotw wrote:
       | Just put everything in a warehouse( via fivetran or some such
       | thing) and just use DBT.
       | 
       | Use airflow as cron runner for dbt.
       | 
       | If you don't need realtime metrics, this formula works way better
       | than convoluted airflow dags.
        
         | ricklamers wrote:
         | We have had a great experience scheduling Meltano/dbt inside
         | Orchest for our Metabase dashboards. As a pattern, combining
         | these declarative/configuration CLI tools with a flexible
         | orchestration layer (Orchest can run any containerized task, it
         | will containerize transparently for you) really shines.
        
       | windows_sucks wrote:
       | The problem I've had with Airflow is that it tries to do way too
       | much: UI/logging/config management
       | 
       | I've really enjoyed using taskflow
       | (https://github.com/taskflow/taskflow) it allows us to employ our
       | existing logging and deployment paradigms.
        
       | llbeansandrice wrote:
       | I'm kinda meh on this article, but it did lead me to this
       | goldmine[1]. We don't use XCOMs at all so a lot of these aren't
       | applicable but other parts absolutely are. We run Airflow at a
       | pretty massive scale and not all of these boundaries were
       | enforced so now it's a huge mess.
       | 
       | [1] https://towardsdatascience.com/apache-airflow-
       | in-2022-10-rul...
        
       | blakeburch wrote:
       | I really agree with Shift 2 ("We unblock analysts" to "We enable
       | everyone"). The problem is that Airflow (and most other OSS
       | orchestrators) are overkill for the majority of data
       | practitioners. They lock workflow development into Python,
       | forcing you to mix platform logic with executional business
       | logic. The complexity to get started building workflows is too
       | high, infrastructure challenges always crop up, and the system
       | itself is a black box for anyone non-technical.
       | 
       | > The tool data engineers need to be effective in this new world
       | does not run scripts, it organizes systems. 100%. You'll still
       | need to run independent scripts, but today's data challenges
       | focus on "how do I connect the stages of data operations
       | together". Teams need to figure out how to connect data ingestion
       | -> data transformation -> data visualization -> alerting and
       | reporting -> ML model deployment -> metadata + catalogs -> data
       | augmentation -> API actions.
       | 
       | The larger goal of orchestration is to prevent downstream
       | processes from running if the data being processed upstream
       | fails. Each stage could be performed with a series of scripts, a
       | SaaS tool, or a mix. Each team is responsible for their own
       | stages, but they need to know how their work connects to the
       | larger picture so when something goes wrong, there's ownership
       | and clarity that drives a quick resolution. Unfortunately, this
       | still doesn't exist in most organizations because the current
       | tooling isn't solving the orchestration and visualization of
       | connected systems super effectively. It's instead enabling one-
       | off, disconnected data processes.
       | 
       | Disclaimer: I built Shipyard (www.shipyardapp.com) to address
       | many of these concerns of simplifying the ability to connect data
       | tools and quickly automate and action on data.
        
       | rotten wrote:
       | I've recently had the sales team from Magniv.io pestering me to
       | try it as an alternative to Airflow with a "shift left"
       | perspective for automating jobs. I wasn't convinced enougy by the
       | value prop to dive deeper. I think it was a language problem - I
       | was just having trouble understanding and relating to the
       | problems they solve and then figuring out whether or not I have
       | those problems too.
       | 
       | I'm running into the same issue with this guy's post, although a
       | little less so. The question he seems to ask is "With a complex
       | pattern of data flows, if something breaks, how do you recover?"
       | His argument is that Airflow does not offer enough visibility
       | into the full data trace nor enough tools to apply recovery rules
       | for repairing broken bits.
       | 
       | I think I agree, but prometheus doesn't really solve that. Nor
       | necessarily does better management of automated job queue backlog
       | management and job retries.
       | 
       | He also complains about some syntax and design choices that
       | predate MyPy and Pydantic and modern Async Python coding. Those
       | seem fairly easy things to drag Airflow forward with in future
       | releases.
        
       | zentrus wrote:
       | Airflow helped my team out a lot a couple years ago mainly for
       | the simplicity of the topdown UI-based view of a complicated ETL
       | AND the ability to retry parts of the ETL.
       | 
       | We had lots of lessons learned. For instance, why does
       | PythonOperator even exist? It takes a callable and thus you're
       | likely not going to see good coding pattern emerge for something
       | that needs to be 1000+ LoC. Instead, we just subclassed
       | BaseOperator and used tried-and-true OO principles.
        
       | geertj wrote:
       | We tried to set up Airflow in our team in the past. The big
       | problem we encounrted is that its unit of management (I believe
       | it's called a "job" but I'm rusty on this) is too low level. Our
       | pipeline processes a lot of data and we have millions of jobs per
       | day. Once Airflow has an (planned or unplanned) outage, 10s of
       | thousands of job start piling up, and it never recovers from
       | that.
       | 
       | In the end we replaced our data orchestration with a stateless
       | lambda that for a configured time interval 1/ looks at what
       | output data is missing, 2/ cross-references that with running
       | jobs (in AWS Batch), and 3/ submit jobs for missing data that has
       | no job. Jobs themselves are essentially stateless. They are never
       | restarted and we don't even look at their status. If one fails we
       | notice because there will be a hole in the output and we
       | therefore submit a new one. Some safety precautions are added to
       | prevent a job from repeatedly failing, but that's the exception.
       | 
       | Maybe Airflow has moved on from when we last tried it. But this
       | was our experience.
        
         | peteradio wrote:
         | What do you do for repeated failures? Does it get flagged for a
         | manual debug or does it kick into a different mode of
         | automation?
        
           | geertj wrote:
           | We notice repeated failures because we have metrics on our
           | "up to dateness", and those metrics will stall. We also send
           | logs to CloudWatch logs and alarm on certain threshold of
           | errors. Once an alarm fires, we investigate manually and see
           | why the job is failing. This happens occasionally but not too
           | much. While we are investigating, we are spinning up repeat
           | jobs with some frequency, but this hasn't proved to be a
           | problem.
        
         | hatware wrote:
         | > The big problem we encounrted is that its unit of management
         | (I believe it's called a "job" but I'm rusty on this) is too
         | low level. Our pipeline processes a lot of data and we have
         | millions of jobs per day. Once Airflow has an (planned or
         | unplanned) outage, 10s of thousands of job start piling up, and
         | it never recovers from that.
         | 
         | That sounds more like an architecture-at-scale problem than
         | something that is Airflow's 'fault.' Airflow may never have
         | been the right tool for the job but it's getting all the blame.
        
       ___________________________________________________________________
       (page generated 2022-08-02 23:01 UTC)