[HN Gopher] Airflow's Problem
___________________________________________________________________
Airflow's Problem
Author : cloakedarbiter
Score : 231 points
Date : 2022-08-02 12:10 UTC (10 hours ago)
(HTM) web link (stkbailey.substack.com)
(TXT) w3m dump (stkbailey.substack.com)
| jxi wrote:
| I've had good success with https://dagster.io, which is much more
| opinionated about your pipelines, including properly typing
| inputs and outputs.
| calebm wrote:
| > If it sounds like you could simply replace Airflow with
| basically any other job execution engine, that's because you
| could.
|
| Has anyone tried Luigi for data engineering pipelines?
| datascientist wrote:
| Recent perspectives from the creators of Prefect, Dagster, Flyte,
| and Orchest => https://gradientflow.com/summer-of-orchestation/
| claytonjy wrote:
| Seems to be missing Temporal/Cadence, which I'm very excited
| about, but I've never heard of Flyte or Orchest.
| ciguy wrote:
| My current company has been having a lot of success with Dagster.
| It seems to give a lot more flexibility thanAirflow in terms of
| defining the pipeline and where to run it. It's also a bit
| friendlier when things fail and need to be backfilled or retried
| IMO. Airflow feels like it's somewhat legacy at this point in
| time. It served a need well but the needs have changed now.
| ForHackernews wrote:
| Airflow is super clunky, but it gets the job done (mostly).
|
| I'm kind of a fan of Prefect as an alternative:
| https://docs-v1.prefect.io/core/about_prefect/why-not-airflo...
| didip wrote:
| The problem with airflow is its scalability (or lack thereof) and
| DAG dependency management.
|
| Airflow successors must figure out how to distribute the cron and
| all dependencies should be self contained in a Docker image.
| pdinny wrote:
| The post feels like a bait-and-switch in the sense that it
| presents itself as about Airflow's shortcomings but focuses
| mostly on problems that Airflow doesn't attempt to solve.
|
| Airflow can certainly be frustrating and it doesn't solve _all_
| workflow orchestration problems. Surely the same thing can be
| said of many tools? This seems mostly like a mismatch of
| expectations.
| cturner wrote:
| When we engage in complex work, it is important to keep our
| options open so we can direct our best effort to the hardest
| problems.
|
| It is rarely clear what the hard problems will be when new to a
| domain. Only as scale kicks in.
|
| We are constantly pitched frameworks that sell themselves as a
| good approach to a domain, but then obstruct engagement with
| the hardest problems when it matters. The developer becomes
| captive of the system that claimed it would steer them right.
|
| This is particularly true of fields where the hard problems are
| integration problems which, by their nature, cannot be
| outsourced to frameworks.
| MontyCarloHall wrote:
| 100% agreed. By analogy, an article titled "pthreads' problem"
| should be about shortcomings in the POSIX multithreading model,
| not an article saying that the implementation of machine-level
| parallelism is irrelevant because Kubernetes exists.
| evrydayhustling wrote:
| FWIW the author is pretty direct about this. After the cute
| beginning, he basically says that his problem is with Airflow's
| scope, not it's execution.
| pdinny wrote:
| Hence the bait and switch. IMHO increasing the scope of
| Airflow (or any tool) is a challenging proposition. Would you
| rather use very few mega-tools with very broad scope (and
| potentially more challenging domain to navigate) or fewer
| more specialised tools that interoperate well together?
|
| Obviously there are trade-offs with either approach, but then
| I'd argue that making Airflow solve more problems will
| introduce more trade-offs too.
| peteradio wrote:
| Having a poor scope is a problem because people will just
| choose not to use you.
| glogla wrote:
| This is why I though the shift from "orchestrate jobs" to "keep
| track of state of assets" that Dagster is trying to do is pretty
| important. But it sems it might not be enough - it still keeps
| clunky (pythonic) interface and I don't know how well it does
| multi-tenancy.
| sethjr5rtfgh wrote:
| Could you clarify what "keep track of state of assets" means?
| geoffjentry wrote:
| It's a mindset shift to a more declarative model. The idea
| has also popped up in other niche orchestrators.
|
| This is an oversimplification but IMO the easiest way of
| picturing it is instead thinking of defining your graph as a
| forward moving thing w/ the orchestrator telling things they
| can be run you shift to defining your graph _nodes_ to know
| their dependencies and they let the orchestrator know when
| they 're runnable.
| glogla wrote:
| https://www.dagster.io/blog/software-defined-assets
|
| Of course, nothing stops Airflow or other tools fron thinking
| this way as well.
| foxbee wrote:
| Like a lot of software, it's matured and along the way has put on
| some weight. I still love it for certain use cases, but tools
| like Dagster are peeking my interest.
| htahir111 wrote:
| Wouldn't it make sense to decouple the orchestration later from
| the authoring layer for the dags? That way you could solve the
| authoring problems separately from the lower level orchestration
| problems. We're trying this over at ZenML (https://zenml.io) but
| have yet to get feedback
| stkbailey wrote:
| Author here - appreciate the comments and reads. To add a bit of
| color -- I spent about a month looking into orchestrators to
| migrate Whatnot's data platform onto earlier this year, and it
| was a miserable experience. We were on AWS Managed Airflow, but
| to stay on it and have a solid platform, I would have been
| writing Github Actions for CI/CD, standing up ECR and IAM roles
| with Terraform, setting up EKS to run Kubernetes jobs, managing
| infra monitoring with Datadog, etc., etc.
|
| In fact, I did end up doing all those things, but we opted for
| Dagster Cloud, because of their focus on improving developer
| efficiency. Their team provided pre-built Github actions for
| CI/CD and recently introduced PR-specific branch deployments,
| which has been amazing. They're moving towards serverless
| execution, built-in ECR repositories, managed secrets. Prefect
| and Astronomer I expect are moving in this direction, too, but I
| liked the Dagster project's energy quite a bit.
|
| As I've waded into the MLOps world as well, it just keeps looking
| like every platform basically devolves into : an orchestrator
| that provisions compute resources and logs metadata into an
| opinionated data model. Catalog tools like Atlan are metadata
| sinks that are trying to build out orchestration/workflow
| capabilities. dbt Cloud of course is just an orchestrator for a
| specific type of data product that is aiming to operationalize
| metadata with its metrics layer.
|
| Orchestration + a metadata data model is a common denominator
| here, and I think the fact that Airflow is so inevitable has made
| it really hard for people to imagine the category as anything
| other than a scheduler, but perhaps some of these new companies
| can break new ground.
| portable_hotpot wrote:
| In your investigation did you try out Flyte at all?
| stkbailey wrote:
| Nope -- just MWAA, Astronomer, Dagster Cloud, and Prefect
| Cloud. In the past I used Argo Workflows pretty extensively
| and have talked about its pros and cons here:
| https://www.youtube.com/watch?v=-cyr_kL-9fc
| awinder wrote:
| "We were on AWS Managed Airflow, but to stay on it and have a
| solid platform, I would have been writing Github Actions for
| CI/CD, standing up ECR and IAM roles with Terraform, setting up
| EKS to run Kubernetes jobs, managing infra monitoring with
| Datadog, etc., etc."
|
| DAGs can be published to S3 for cutting down on like half of
| these dependencies. And the nice thing about MWAA is log &
| stats publishing over cloudwatch, which should flow into any
| existing amazon integrated tooling.
|
| For our team setting up terraform for iam & mwaa, some deploy
| pipelines to s3, and connecting some config bits to wire up
| splunk logs / monitoring pieces was not that much work.
| Initiating a separated vendor relationship & pricing out data
| ingress/egress costs would blow that work out of the water but
| maybe it's a difference in company size/placement.
| jw887c wrote:
| >We were on AWS Managed Airflow, but to stay on it and have a
| solid platform, I would have been writing Github Actions for
| CI/CD, standing up ECR and IAM roles with Terraform, setting up
| EKS to run Kubernetes jobs, managing infra monitoring with
| Datadog, etc., etc.
|
| This sounds like an issue not with Airflow but with
| integration.
| stkbailey wrote:
| Yep, that's what I tried to point out in the article.
| theptip wrote:
| Thanks for the experience report - I have Dagster and Prefect
| on my shortlist to evaluate next time I need to build this, and
| Dagster seems the most promising, so it's good to get another
| datapoint.
|
| One Q - it seems to me that another possible solve (and
| probably how the big guys tend to do it) is to use a dataflow
| engine like Spark/Flink. Did you compare a managed platform
| like Google Dataproc? They also have serverless if you don't
| want a heavy managed cluster, which might make this approach
| more viable for non-huge companies that wouldn't utilize a min-
| spec cluster. (When I last evaluated this they didn't have
| serverless which was a dealbreaker for my small scale).
| chazeon wrote:
| I need to deal with system with complex dependency relationship.
| The system need to stop executing when some step fails. I used to
| look at Airflow. I remember at the time it was the first result
| if you Google DAG something. But it has the same problem as data-
| engineer centric system that it was somewhat over-engineered and
| it relies too much on a center server. It's heavily relying on a
| Python runtime is also something I don't like, feels like
| everything else is a second-class citizen. The nature of our job
| is that we have deal with complex legacy codes, some Fortran
| program, some need a conda environment etc. Later I found what I
| should look at is just some makefile-like system, and I settled
| on Snakemake, which has a nice DSL that forces you to be explicit
| about input / output / etc.. Probably Airflow is just not the
| right tool for a one man team.
| tchiotludo wrote:
| All the issues described in this post lead me to create Kestra
| [0] . Airflow was a true revolution when it was open-source and
| we need thanks its innovation. But I totally agree that a large
| static dag is not appropriate in the actual data world with data
| mesh and domain responsibility.
|
| [0] https://github.com/kestra-io/kestra
| dusted wrote:
| barbecue_sauce wrote:
| Is an Apache open source project considered a hipster-corp?
| lopatin wrote:
| So to confirm, you are annoyed that the name of this project is
| a word.
| dusted wrote:
| Definitely. I'd like the name of this project to be changed
| to be a name, not a common word. Thanks.
| lopatin wrote:
| I'd suggest to raise the issue in the devlist or on GitHub
| Issues to get more visibility.
| dusted wrote:
| I'd be okay with the article headline on HN being "Apache
| Airflow's Problem" so that I know it's about a piece of
| apache software and not something interesting about
| airflow.
| tomwheeler wrote:
| The Apache Software Foundation would also prefer this,
| per their trademark policy:
| https://www.apache.org/foundation/marks/faq/#guide
| lopatin wrote:
| I think context matters and the title "Airflow's Problem"
| doesn't make much sense when talking about the physical
| phenomenon of flowing air.
| morelisp wrote:
| Not really, the past two years have seen probably a order
| of magnitude increase in interest and articles about
| industrial-scale ventilation and air quality.
| lopatin wrote:
| You still wouldn't phrase it like "Airflow's Problem"
| because the concept of airflow is incapable of having a
| problem. It just exists. Some theories _about_ airflow
| may have problems, but then you would specify that in the
| title.
| orthoxerox wrote:
| A few years ago a new guy at our DWH team tried to sell Airflow
| to the rest of the team. They invited me to listen to his talk as
| well, and I was baffled why something so trivial as Airflow was
| being sold as a critically important piece of infrastructure.
|
| Why would I need a glorified server-side crontab if something
| like MS DTS from 1998 could do the same, but better? Sure, Python
| is probably better than whatever DTS generated, but the ops don't
| care either way, since Airflow doesn't care what it's running.
|
| Something as simple as "job A must run after job B and job C, but
| if it doesn't start by 2am, wake up team X. If it doesn't finish
| by 4am, wake up team Y" isn't Airflow's problem, it's your
| problem.
|
| "What's the overall trend for job D's finish time, what is the
| main reason for that?" isn't Airflow's problem, it's your
| problem. "What jobs are on the critical path for job E?" isn't
| Airflow's problem, it's your problem.
|
| "Job F failed for date T and then recursively restart everything
| that uses its results for date T" isn't Airflow's problem, it's
| your problem.
| coldtea wrote:
| > _and I was baffled why something so trivial as Airflow was
| being sold as a critically important piece of infrastructure._
|
| https://news.ycombinator.com/item?id=9224
|
| > _Something as simple as "job A must run after job B and job
| C, but if it doesn't start by 2am, wake up team X. If it
| doesn't finish by 4am, wake up team Y" isn't Airflow's problem,
| it's your problem._
|
| I guess that's one approach to job security. And why not make
| data egress manual too? Why transfer data through the network,
| when you can print them, mail the papers, and type them back
| in? Data input is not the computer's problem, it's your
| problem!
|
| > _Something as simple as "job A must run after job B and job
| C, but if it doesn't start by 2am, wake up team X. If it
| doesn't finish by 4am, wake up team Y" isn't Airflow's problem,
| it's your problem. "What's the overall trend for job D's finish
| time, what is the main reason for that?" isn't Airflow's
| problem, it's your problem. "What jobs are on the critical path
| for job E?" isn't Airflow's problem, it's your problem. "Job F
| failed for date T and then recursively restart everything that
| uses its results for date T" isn't Airflow's problem, it's your
| problem._
|
| The whole idea of writing programs is making things
| automatable. That is, making them the computer's problem, not
| our problem. We get the higher level problem of writing the
| automation once, and fixing any bugs in our code, then we get
| to enjoy putting it to work for us...
| robertlagrant wrote:
| > isn't Airflow's problem, it's your problem
|
| This is a baffling statement.
| Hippocrates wrote:
| I despise airflow and how cemented it is as data infrastructure.
| It such a useful and basic concept but a nightmare to manage, and
| it works like junk. It's taken me 3 separate jobs over 7 years to
| realize that it's probably not our fault. Everyone seems to
| struggle with the same things: flaky scheduler that is slow to
| run tasks, confusing and redundant sounding settings that apply
| at up to three different levels (environment, job, task). It
| invites less experienced users to write a sea of spaghetti code
| in a monolithic DAGs repo. People wind up doing heavy data
| munging in python operators, which clobbers scalability and
| reliability. It also can't handle a large number of parallel
| tasks or frequent runs. It seems to have miserable scalability
| for the resources given, and bad controls for auto scaling. The
| UI feels dated and unintuitive. XComs seem useful to everyone but
| work like crap and actually an anti-pattern.
|
| I've also tried it on Cloud Composer (google managed) and
| automated upgrades always trashed the cluster. It's not well
| designed for GKE because it writes logs to files and requires
| stateful sets. Testing the code is a huge burden due to the vast
| environment and dependencies needed to make it work locally.
|
| I'm eager to rid my life of it and test out temporal for some of
| the high concurrency/frequency cases we have.
| zibarn wrote:
| Not experienced here but as a genuine interest can you tell
| what problems airflow solves that can't be handled by celery
| and rabbitmq?
| code_biologist wrote:
| An analogy is "can you tell what problems Django solves that
| can't be handled by wsgi and psycopg?" Nothing fundamentally
| different, but life is a whole lot easier with Django.
| Honestly if you're doing data engineering and you haven't
| spent time with a good DAG runner, you're doing yourself a
| real disservice.
|
| My sibling comment did a good job explaining, but the UI +
| configurable storage + configurable triggers all out of the
| box make life a lot easier.
| Hippocrates wrote:
| I have not used celery + rabbitmq but I assume that combo is
| like sidekiq + redis, or any other job queue + worker system.
|
| Airflow packages those things together and adds some
| additional features - UI with Graph, gantt, logs and other
| views of the workflow - Users and permissions - Places to
| store config - Mechanisms for passing small data between
| tasks - Various "sensors" for triggering workflows - Various
| operators that interact with common data-oriented systems
| (bigquery, snowflake, s3, you name it). These are basically
| libraries that expose a config-forward API.
|
| Probably the main selling point is the pre-made operators,
| but in short it is a complete solution with bells and
| whistles that aligns itself with the data ecosystem.
| bushbaba wrote:
| The idea behind airflow is great. What sucks is people using it
| to do heavy processing. Maybe with serverless/k8s airflow could
| fan out the processing to a cluster to allow for flexibility.
| But then, I guess you end up re-writing spark et-al.
| patwater10 wrote:
| <shrug>
|
| ETL seems just like one of those perennial challenges that resist
| humanity's efforts to categorize the world into need and tidy
| boxes
| markus_zhang wrote:
| I don't really agree with decentralized ETL. You can't imagine
| the mess people proudly wrote. And yeah I have been on the other
| aide of the trench too.
| shmoogy wrote:
| I've been using airflow for about 2 years now in production. It's
| been mostly good - the few times things go wrong, it's a huge
| pain in the ass to figure out why... but it's significantly
| better than just straight cron on Linux. Airflow 2 has improved a
| lot of speed and catching up issues from airflow 1.x
|
| I don't have time to investigate other solutions like dagster and
| prefect and migrate jobs to it for testing.
| glogla wrote:
| If it is just you, you are fine, and I'm not sure other tools
| would have that much benefit.
|
| Trouble with Airflow starts when multiple teams and user types
| start to share it.
| shmoogy wrote:
| I've definitely noticed more issues after adding users, but
| it's more that they don't actually understand a lot of what
| they're trying to do and cause problems when writing dags.
| glogla wrote:
| Yeah, Airflow isn't multi-tenant.
|
| People can potentially overwrite each other's DAGs.
| Credential management is complicated. Broken DAG can stop
| whole Airflow. Slow DAG can impact performance of whole
| Airflow. Getting DAGs to wait for each other (like one team
| prepares data up to a point and then other team builds on
| that) is kind of a nightmare. Sometimes people want
| features from newer Airflow, but some other team built DAG
| that isn't forward compatible. Etc etc.
|
| But I'm not sure there actually is a better solution
| elsewhere. At least I have not seen it yet, maybe Dagster
| is on a good road.
|
| But as I said, for centralized solutions it works really
| well.
| kumare3 wrote:
| I remember having this feeling a few years ago. What I realized
| is that airflow has taught us a few bad habits and also brought
| ahead an interesting paradigm of the vertical workflow engine.
|
| I agree airflow is old, legacy and ideally folks should not use
| it, reality is there is a lot of pipelines already built with it
| - sadly. I think as a community we have to start moving away from
| it for more complicated problems.
|
| Disclaimer: I created Flyte.org and heavily believe in
| decentralized development of DAGs and centralized management of
| infrastructure
| edumucelli wrote:
| More recent tools such as Dagster and Prefect have much more to
| offer. One simple example is communication between tasks. Airflow
| has a clunky system for that called XCom. The actual author of
| XCom says you probably should not use it due to the level of
| hackery it has under the hood [1]:
|
| On Dagster and Prefect you communicate between tasks as if you
| were writing pure Python. On Airflow on the other hand ...
|
| [1] https://www.youtube.com/watch?v=TlawR_gi8-Y&t=740s
| marcinzm wrote:
| I'm not sure if linking to a talk from 2018 in 2022 for a
| project that is being actively worked on (a bunch of Python
| abstractions for xcom were added in 2.0) is fair.
| byteflip wrote:
| Yes, curious if the Taskflow API introduced as part of
| Airflow 2.0 reduces this pain. It appears much easier/saner
| than working with XCOMs directly - less coupling and removes
| the need for lots of boilerplate code.
| karog wrote:
| There's also Flyte, which is natively data aware and schedules
| tasks around data dependencies. The syntax is essentially pure
| python too.
| MontyCarloHall wrote:
| Dismissing Airflow for not being Astronomer is like dismissing
| Linux for not having the capabilities of a large-scale
| hypervisor.
|
| Replace "Airflow" with "Linux," "data engineers" with "systems
| programmers," and "Astronomer" with your hypervisor of choice
| (Xen/VMWare/etc.), and you can see how absurd the author's point
| is: My problem is that ~Airflow~ Linux was not
| designed to address [high-level systems architecture] problems.
| We don't need a better [Linux], but we need a higher-level one: a
| system that enables ~data engineers~ systems programmers to think
| at a platform level. In fact, [Linux] is already
| displaced. [Linux] qua [Linux] is already obsolete, and it
| happened right within the [Linux] ecosystem. It's called
| ~Astronomer~ Xen/VMWare/etc. If it sounds like you
| could simply replace [Linux] with basically any other ~job
| execution engine~ operating system, that's because you could.
|
| This is where the argument falls apart. Yes, for very large,
| complex deployments, higher-level orchestration is important, but
| the choice of low-level execution engine is also still hugely
| relevant, just as the choice of guest OS is still hugely relevant
| when discussing large deployments of VMs.
|
| Furthermore, very few people actually need very large scale
| deployments; user experience and capabilities at the low-level
| are what most users actually care about.
| orwin wrote:
| Honestly, we have to set up airflow at my job for some datalog
| collection and treatment. Which is fine, only i'm pretty sure
| we had exactly the same issue at my old job that we fixed in
| half a day, including testing and deployment, with a perl
| script. And i think in this particular instance (gitlab logs)
| it was treated with 90% Awk. Meanwhile my coworkers still have
| issues after almost a week (not all of this is on airflow, but
| still).
|
| I'm not saying Airflow is bad (we did set up a lot of hadoop
| clusters and other apache products at my old job, and our
| clients used airflow a lot), but i think the evangelists are so
| good they push airflow for everything, and this is bad. OP did
| use airflow for something it was not really designed for, and
| it sucked, but i do have this impression that tech writers and
| apache evangelists deserve some of the blame.
| mywittyname wrote:
| Managed Airflow doesn't even solve any of the author's outlined
| frustrations. It keeps the "obscene" syntax, it's still
| stateless, it's not "decentralized" etc.
|
| Honestly, the article is so disingenuous that it comes off like
| a paid-for puff piece for Astronomer. It's the article-
| equivalent of the late-night infomercial guy who rips open a
| bag of potato chips like the hulk because he doesn't have this
| special tool that's just four easy payments of $9.99.
| glogla wrote:
| FYI, the infomercials with the strange tools fixing strange
| problems are usually focused on old or disabled people.
| Opening bag of chips with ridiculius tool sounds stupid, but
| it might help a stroke survivor or someone with one arm - but
| the sellers don't want to show those struggle on the screen
| to avoid humiliating people, so you see pefectly healthy
| looking young people spilling things like they have some
| neurodegenerative disorder or something. Because the target
| audience might.
|
| Not saying infomercials people are angels, of course, but I
| wanted to sharethus somewhat nonobvious context.
|
| (To stretch the metaphor, Airflow management system that
| gives everyone their own Airflow might be ridiculous but make
| sense for companies where cooperation is difficult :))
| TTPrograms wrote:
| The new TaskFlow API has been part of AirFlow 2.0 since its
| release in 2020: https://airflow.apache.org/docs/apache-
| airflow/stable/tutori...
| stkbailey wrote:
| Interesting, Astronomer was actually my last choice for
| orchestrator. We went with Dagster, but I didn't want to make
| the takeaway "Dagster solves these problems", because it
| doesn't directly. Astronomer was just the best foil for the
| "meta-orchestrator" space that seems to be evolving, and
| which _can_ address these problems.
| hatware wrote:
| Agreed. The author is blaming Airflow for what are ultimately
| poor architecture decisions.
|
| I will admit it's not easy to figure out best practices with
| Airflow, but if you make bad decisions and your system doesn't
| scale with the problem, you didn't understand the problem or
| how to solve it in the first place. The tools you chose are
| second to that.
| peteradio wrote:
| You may not know very precisely the time constants you are
| dealing with in your problem until you give it a shot.
| pharmakom wrote:
| I don't understand the desire to describe DAGs in Python... it's
| a fine scripting language but pretty horrible for this
| declarative description stuff.
| throwaway787544 wrote:
| Snowflake is the future. We shouldn't be writing code at all, we
| should be throwing data into a big hole in the ground and then
| querying the hole, or attaching a query to the hole so as data
| goes in the query does something wjth it, and chaining those
| queries. The fact that anyone is writing anything more complex
| than SQL to do this is a failure of imagination. Snowflake is
| intended to remove all the unnecessary engineering from you just
| putting data somewhere and doing something useful with it easily.
| rldjbpin wrote:
| Having been forced to work with an obsolete version of Airflow at
| work, I can attest to how narrow-minded the project's focus was
| when it was originally created. The scheduling quirks and UTC
| defaults are enough to paint the picture here.
|
| Not completely sure if most of the issues I've faced were
| resolved in the future releases, but I don't fully agree with the
| take of the article. Like go with the scheduler that works for
| your current and potential future needs. The reason why we
| continue to use Airflow despite the issues is because it works so
| well with our workflows. This does mean that I would recommend it
| to another team.
| llambda wrote:
| To address a point the author makes: I'm entirely unconvinced the
| "shift left" mentality of data democracy (aka business operators
| should write sql) is actually shifting left or a worthy path to
| pursue for most businesses. More recently this 2010s fad seems to
| be dying and in favor we're seeing centralized data efforts that
| produce data products.
|
| One of the most significant pitfalls of data is failing to
| interrogate the value it provides and assuming that if you give
| everyone access all the time the magic will happen. The truth is
| value does not simply materialize just as value does not
| magically spring from computers by a human powering it on (okay
| sure, you may have already automated the value but that's
| actually the point I'm about to make). In both cases it requires
| an experienced practitioner who collaborates with a larger team
| to intersect their work with the business needs.
|
| Data is tricky, all the more so because it's often seen as a
| panacea by business leaders who aren't connected with the work of
| extracting that value.
| itsoktocry wrote:
| > _that if you give everyone access all the time the magic will
| happen_
|
| There's much ongoing discussion about this is the data world,
| often revolving around "self-service analytics".
|
| Unless you're talking about "our analysts don't have to clean
| data all the time", which, for a large enough organization
| makes sense, "self-service" for non-technical folks is futile
| and pointless. They need specific answers to specific
| questions, not the ability to infinitely explore the data.
| Organizations should _desire_ that kind of focus, not prevent
| it.
| datavirtue wrote:
| They idea was that they were going to hire an army of data
| scientists and become google...magically.
|
| Reality smacked that shit down hard. I left data engineering
| because the projects were all over the place, wildly
| undisciplined and unfocused.
|
| You were lucky to have source control let alone an
| understanding from the business that these projects were in
| fact software development.
|
| I switched back to software engineering because at least
| there is a faint realization that we are...building software.
|
| I might go back when the dust clears.
|
| "Why do we need to hire programmers...I thought we needed
| data engineers?"
|
| "Because the data pipelines are all built with thousands of
| lines of code. Java, python, Fortran, you name it...and your
| job post only mentioned SQL and data modelling"
|
| I could go on forever.
| mason55 wrote:
| This is the constant argument I have with people about data
| products.
|
| You don't need to expose more dimensions or get the users
| more access to the raw data. You need to understand what
| their business is and what their business problems are and
| help them answer those specific questions quickly and
| succinctly.
|
| Yes, there are certainly times where people use huge amounts
| of raw data to uncover the answer to a question they didn't
| know they had. But it's rare, it's expensive to support, and
| most businesses are going to be able to do anything with it
| anyway (a whole org built to do X isn't suddenly going to
| shift to do Y because you discovered some insight in a random
| report).
| abirch wrote:
| I've seen data errors because of joins and aggregations. Data
| democratization can be a net negative, especially if people
| don't question the graphs they see.
| mumblemumble wrote:
| With all credit due to Google's excellent and under-appreciated
| paper _Machine Learning: The High Interest Credit Card of
| Technical Debt_ [1], I submit that Big Data is the high
| interest home equity line of credit of business operations
| debt.
|
| It's not that big data tools aren't useful. It's that, when you
| just start amassing huge piles of data without a clear up-front
| plan for how it will be used, and assume that a whole bunch of
| people who have never heard of sampling bias or multiple
| comparisons bias or Coase's Law [2] can figure out what to do
| with it later, you're setting yourself up for a Bad Time.
| 1: https://research.google/pubs/pub43146/ 2: "If you
| torture the data long enough, it will confess."
| htrp wrote:
| > I submit that Big Data is the high interest home equity
| line of credit of business operations debt.
|
| I like this but it's kinda like the payday loan of business
| operations.
| abirch wrote:
| I'd say that Big Data is the Collateralized Debt Obligations
| of business operations. It looks fabulous from afar but it
| can blow things up quickly if there's no understanding of the
| internals.
| systemvoltage wrote:
| Yet, we abide by data-oriented conclusions outside of
| software engineering all the time. From Academics papers to
| FDA to crime statistics.
| mumblemumble wrote:
| I won't say any of those are perfect. But there's at least
| a little more effort toward responsible data analysis in
| academia. The FDA brings an interesting example to mind.
| Take a look at how, on paper, drugs suddenly magically
| became less effective when the FDA started requiring
| clinical trial pre-registration in 2007.
|
| It's also worth noting that, over the past few decades,
| most academic fields have been getting increasingly
| skeptical of the value of correlative research on pre-
| existing data sets. Even among people who have been
| extensively trained in how to do it properly. And yet, the
| vast majority of big data business plans I've seen in
| practice boil down to "collect a huge data set and then let
| people do correlative research on it."
| systemvoltage wrote:
| Agreed, I want more scrutiny than some entity flashing
| "Here is the data". It can easily be exploited behind the
| veneer of data-based-credibility.
| jon_adler wrote:
| Is there any love for the Argo [1] project suite (Workflows,
| Events, CD) for this type of use case? I haven't tried it out
| myself yet however it does look interesting.
|
| [1] https://argoproj.github.io
| ricklamers wrote:
| Argo is pretty amazing if you want to take advantage of the
| work Kubernetes has done to scale resource efficiently across a
| cluster of compute nodes.
|
| If you're looking for something that's a bit more high level
| and friendly to expose directly to your data team (data
| scientists/data engineers/data analysts) you can check out
| https://github.com/orchest/orchest
|
| You can think of it as a browser UI/workbench for Argo
| scheduled pipelines. Disclaimer: author of the project
| ForHackernews wrote:
| I've never used their workflows thing, but having been forced
| to live with ArgoCD it sounds horrifying.
|
| Argo is another over-engineered "CNCF" thing trying to ride the
| Kubernetes hype train. It's all "eventually consistent", which
| makes it extraordinarily difficult to see when any particular
| thing actually happened. Is my code deployed? Who knows, Argo
| is "syncing".
|
| Check out these great docs: https://argoproj.github.io/argo-
| workflows/rest-api/
|
| > API reference docs :
|
| > Latest docs (maybe incorrect)
|
| > Interactively in the Argo Server
| UI.<https://localhost:2746/apidocs> (>= v2.10)
|
| Yes, that is a localhost URL on their website.
| robertlagrant wrote:
| How do you know if anything is deployed if it hasn't come
| back and confirmed it's deployed? Manual only?
| saltmeister wrote:
| biellls wrote:
| I have my own opinion on Airflow's pain points and created
| Typhoon Orchestrator (https://github.com/typhoon-data-
| org/typhoon-orchestrator) to solve them. It doesn't have many
| stars yet but I've used it to create some pipelines for medium
| sized companies in a few days, and they've been running for over
| a year without issues.
|
| In particular I transpile to Airflow code (can also deploy to
| Lambda) because I think it's still the most robust and well
| supported "runtime", I just don't think the developer experience
| is that good.
| jdoliner wrote:
| I was at Airbnb when we open-sourced Airflow, it was a great
| solution to the problems we had at the time. It's amazing how
| many more use cases people have found for it since then. At the
| time it was pretty focused on solving our problem of
| orchestrating a largely static DAG of SQL jobs. It could do other
| stuff even then, but that was mostly what we were using it for.
| Airflow has become a victim of its success as it's expanded to
| meet every problem which could ever be considered a data
| workflow. The flaws and horror stories in the post and comments
| here definitely resonate with me. Around the time Airflow was
| opensource I starting working on data-centric approach to
| workflow management called Pachyderm[0]. By data-centric I mean
| that it's focused around the data itself, and its storage,
| versioning, orchestration and lineage. This leads to a system
| that feels radically different from a job focused system like
| Airflow. In a data-centric system your spaghetti nest of DAGs is
| greatly simplified as the data itself is used to describe most of
| the complexity. The benefit is that data is a lot simpler to
| reason about, it's not a living thing that needs to run in a
| certain way, it just exists, and because it's versioned you have
| strong guarantees about how it can change.
|
| [0] https://github.com/pachyderm/pachyderm
| carlsborg wrote:
| Cool. Are there any published benchmarks on how the data
| versioning engine scales?
| FridgeSeal wrote:
| > Shift 1: "We know the lineage" to "We know what in god's name
| is happening"
|
| Bro I can't even get my company to the _first_ part, and we're
| collectively already having issues with the second? What is
| everyone else's read on this situation in general? Do you all
| have row and table level lineages for your data? For pipelines
| that people are actively using? Every company I've ever been in
| can hardly figure out where finance gets last years "magical
| excel sheet", let alone be close to a spot where they're actively
| using data lineage tools.
|
| I also don't like Airflow, but for somewhat different reasons.
|
| I think it couples orchestration and transformation too tightly,
| I don't understand the desire to integrate everything with your
| actual runtime Python code - I think it's markedly the wrong
| level of abstraction/integration and limits your engineering
| capacity. There's undoubtedly some good engineering, it's come a
| long way, and it's mighty popular, but every time I look at a
| repo that uses it, the only read I get is "cross-cutting-chaos".
| OrangeMonkey wrote:
| In some fields its more important than others.
|
| In life sciences research to support synthetic control arms,
| the FDA is caring more about the lineage/manipulation of the
| data than the data science models used to predict X/Y/Z.
|
| IE - what was the data originally, what did it end up as prior
| to ingestion into AIML, why was it changed, what steps were
| involved, etc.
|
| There are not a ton of good out of the box solutions for data
| lineage and its driving me nuts.
|
| We have Apache NIFI which promises data lineage out of the box
| and _appears_ to deliver. I've never implemented it though.
|
| We have pachyderm which has some support here but I don't know
| about it.
|
| Besides that it appears roll-your-own.
|
| I kind of wish there was an accepted best practice for data
| lineage but its - surprisingly - wild west. And its completely
| 100% required for industry use.
| tomrod wrote:
| DBT does pretty well?
| dominotw wrote:
| Just put everything in a warehouse( via fivetran or some such
| thing) and just use DBT.
|
| Use airflow as cron runner for dbt.
|
| If you don't need realtime metrics, this formula works way better
| than convoluted airflow dags.
| ricklamers wrote:
| We have had a great experience scheduling Meltano/dbt inside
| Orchest for our Metabase dashboards. As a pattern, combining
| these declarative/configuration CLI tools with a flexible
| orchestration layer (Orchest can run any containerized task, it
| will containerize transparently for you) really shines.
| windows_sucks wrote:
| The problem I've had with Airflow is that it tries to do way too
| much: UI/logging/config management
|
| I've really enjoyed using taskflow
| (https://github.com/taskflow/taskflow) it allows us to employ our
| existing logging and deployment paradigms.
| llbeansandrice wrote:
| I'm kinda meh on this article, but it did lead me to this
| goldmine[1]. We don't use XCOMs at all so a lot of these aren't
| applicable but other parts absolutely are. We run Airflow at a
| pretty massive scale and not all of these boundaries were
| enforced so now it's a huge mess.
|
| [1] https://towardsdatascience.com/apache-airflow-
| in-2022-10-rul...
| blakeburch wrote:
| I really agree with Shift 2 ("We unblock analysts" to "We enable
| everyone"). The problem is that Airflow (and most other OSS
| orchestrators) are overkill for the majority of data
| practitioners. They lock workflow development into Python,
| forcing you to mix platform logic with executional business
| logic. The complexity to get started building workflows is too
| high, infrastructure challenges always crop up, and the system
| itself is a black box for anyone non-technical.
|
| > The tool data engineers need to be effective in this new world
| does not run scripts, it organizes systems. 100%. You'll still
| need to run independent scripts, but today's data challenges
| focus on "how do I connect the stages of data operations
| together". Teams need to figure out how to connect data ingestion
| -> data transformation -> data visualization -> alerting and
| reporting -> ML model deployment -> metadata + catalogs -> data
| augmentation -> API actions.
|
| The larger goal of orchestration is to prevent downstream
| processes from running if the data being processed upstream
| fails. Each stage could be performed with a series of scripts, a
| SaaS tool, or a mix. Each team is responsible for their own
| stages, but they need to know how their work connects to the
| larger picture so when something goes wrong, there's ownership
| and clarity that drives a quick resolution. Unfortunately, this
| still doesn't exist in most organizations because the current
| tooling isn't solving the orchestration and visualization of
| connected systems super effectively. It's instead enabling one-
| off, disconnected data processes.
|
| Disclaimer: I built Shipyard (www.shipyardapp.com) to address
| many of these concerns of simplifying the ability to connect data
| tools and quickly automate and action on data.
| rotten wrote:
| I've recently had the sales team from Magniv.io pestering me to
| try it as an alternative to Airflow with a "shift left"
| perspective for automating jobs. I wasn't convinced enougy by the
| value prop to dive deeper. I think it was a language problem - I
| was just having trouble understanding and relating to the
| problems they solve and then figuring out whether or not I have
| those problems too.
|
| I'm running into the same issue with this guy's post, although a
| little less so. The question he seems to ask is "With a complex
| pattern of data flows, if something breaks, how do you recover?"
| His argument is that Airflow does not offer enough visibility
| into the full data trace nor enough tools to apply recovery rules
| for repairing broken bits.
|
| I think I agree, but prometheus doesn't really solve that. Nor
| necessarily does better management of automated job queue backlog
| management and job retries.
|
| He also complains about some syntax and design choices that
| predate MyPy and Pydantic and modern Async Python coding. Those
| seem fairly easy things to drag Airflow forward with in future
| releases.
| zentrus wrote:
| Airflow helped my team out a lot a couple years ago mainly for
| the simplicity of the topdown UI-based view of a complicated ETL
| AND the ability to retry parts of the ETL.
|
| We had lots of lessons learned. For instance, why does
| PythonOperator even exist? It takes a callable and thus you're
| likely not going to see good coding pattern emerge for something
| that needs to be 1000+ LoC. Instead, we just subclassed
| BaseOperator and used tried-and-true OO principles.
| geertj wrote:
| We tried to set up Airflow in our team in the past. The big
| problem we encounrted is that its unit of management (I believe
| it's called a "job" but I'm rusty on this) is too low level. Our
| pipeline processes a lot of data and we have millions of jobs per
| day. Once Airflow has an (planned or unplanned) outage, 10s of
| thousands of job start piling up, and it never recovers from
| that.
|
| In the end we replaced our data orchestration with a stateless
| lambda that for a configured time interval 1/ looks at what
| output data is missing, 2/ cross-references that with running
| jobs (in AWS Batch), and 3/ submit jobs for missing data that has
| no job. Jobs themselves are essentially stateless. They are never
| restarted and we don't even look at their status. If one fails we
| notice because there will be a hole in the output and we
| therefore submit a new one. Some safety precautions are added to
| prevent a job from repeatedly failing, but that's the exception.
|
| Maybe Airflow has moved on from when we last tried it. But this
| was our experience.
| peteradio wrote:
| What do you do for repeated failures? Does it get flagged for a
| manual debug or does it kick into a different mode of
| automation?
| geertj wrote:
| We notice repeated failures because we have metrics on our
| "up to dateness", and those metrics will stall. We also send
| logs to CloudWatch logs and alarm on certain threshold of
| errors. Once an alarm fires, we investigate manually and see
| why the job is failing. This happens occasionally but not too
| much. While we are investigating, we are spinning up repeat
| jobs with some frequency, but this hasn't proved to be a
| problem.
| hatware wrote:
| > The big problem we encounrted is that its unit of management
| (I believe it's called a "job" but I'm rusty on this) is too
| low level. Our pipeline processes a lot of data and we have
| millions of jobs per day. Once Airflow has an (planned or
| unplanned) outage, 10s of thousands of job start piling up, and
| it never recovers from that.
|
| That sounds more like an architecture-at-scale problem than
| something that is Airflow's 'fault.' Airflow may never have
| been the right tool for the job but it's getting all the blame.
___________________________________________________________________
(page generated 2022-08-02 23:01 UTC)