[HN Gopher] New Apache Airflow Operators for Google Generative AI
___________________________________________________________________
New Apache Airflow Operators for Google Generative AI
Author : seeyam
Score : 32 points
Date : 2024-08-12 13:46 UTC (9 hours ago)
(HTM) web link (cloud.google.com)
(TXT) w3m dump (cloud.google.com)
| ssahoo wrote:
| So they added three brand new Airflow operators to interact with
| Vertex AI's generative models to their Cloud Composer
| annexrichmond wrote:
| I can't imagine why people use Airflow these days. Its DAG DSL
| means you have to fit your biz logic to their paradigm. I expect
| a framework to fit nicely with existing biz logic. And that means
| you're essentially stuck. It has no interoperability; hence the
| need to build custom operators for everything.
|
| It doesn't scale for teams because the package is incredibly
| bloated, so once you need to run multiple images via K8s
| operators, you lose out on a lot of other Airflow functionality
| because it all assumes you have your biz logic embedded within
| DAGs, but at the same time I don't know why anyone would develop
| this way.
| ssahoo wrote:
| What alternative do you recommend?
| annexrichmond wrote:
| I've done a lot evaluation of such frameworks, and I hope to
| publish more on it. It really depends on your requirements. I
| would look at Prefect, Flyte, Dagster, and Temporal ahead of
| Airflow, though.
| whalesalad wrote:
| We've been running Dagster for ~6 months on ECS and it has
| been rock solid. (knocks on wood)
| 0cf8612b2e1e wrote:
| Airflow is kind of the default platform. If you do not have
| a goto alternative which is superior in multiple dimensions
| today, I think that says the problem itself is hard.
|
| Might as well go with the devil everyone knows.
| 8organicbits wrote:
| Where do you think you'd publish it? I'm interested to read
| the details.
| computershit wrote:
| Prefect has more polish and is easier to get started than any
| of the existing options. We've been running their self-hosted
| for over three years and it basically stays out of the way.
|
| We looked at Dagster as well as Airflow. I really, really
| liked Dagster but the BI team didn't.
|
| I cannot imagine using Airflow for anything meaningful and
| respecting myself at the end of a work day. The local
| development experience was abysmal. Deployments sucked.
|
| That being said, if you're not using anything except maybe
| cron right now, and if you don't care about the solution
| being a proper data pipelines orchestration platform
| trademark symbol, I'd recommend starting with Windmill.
| nyrikki wrote:
| The DAG constraints are a feature to prevent architectural
| erosion.
|
| Note that any feedforward neural network (e.g. anything using
| attention) is also a DAG.
|
| You can always encapsulate more complex logic in a single
| subroutine or even choose a hex or onion pattern for biz logic.
|
| But what do you suggest besides DAGs + saga patterns that
| doesn't result in a ball of mud over time for distributed
| systems?
| annexrichmond wrote:
| DAGs are fine. Their DSL is not because it's abstracting the
| wrong things in the wrong place. It's a global file with
| static definitions; why do you hardcode KubernetesOperator
| when maybe you don't want a KubernetesOperator in a test env?
| There is also no type safety between tasks/operators. And
| it's an extremely dependency heavy package with no
| client/server isolation so bundling Airflow for multiple
| teams is just not viable.
| lyu07282 wrote:
| > It's a global file with static definitions
|
| Why do you think you define DAGs in Python? The point is to
| be dynamic, exactly to do things like switching between
| operator types based on things like the environment. Sorry
| you really don't seem to know a lot about airflow for your
| strong opinions against it, I'm out, no offense intended at
| all.
| annexrichmond wrote:
| I'm well aware that's "possible", but if you have to
| build your own abstractions and CI/CD to make it usable
| this way, it doesn't seem very well designed.
| lyu07282 wrote:
| > hence the need to build custom operators for everything
|
| But isn't that the point? I never got the impression that you
| were supposed to build everything with the built-in operators,
| they are just the "batteries included" part you can wrap or
| extend. I really don't understand the criticism.
|
| My criticism was always xcom, but that's a moot point now that
| we have TaskFlow. Airflow is awesome and very flexible you just
| have to adapt to how it's supposed to be used instead of
| fighting against it I find.
| annexrichmond wrote:
| That might be fine for small teams, but any 100+ person
| company would already have their own abstractions and don't
| need to reinvent those wheels just for Airflow. But even if
| you're smaller I still think you're setting up for failure if
| all your biz logic is within Airflow unless you are careful
| about making it reusable/shareable for other contexts. Eg,
| what if you want to convert some scheduled pipeline to some
| event-driven architecture with other systems? That means
| needing to refactor everything out. It's not interoperable or
| modular and that's why it should be avoided.
| nooorofe wrote:
| > if you want to convert some scheduled pipeline to some
| event-driven architecture
|
| Airflow has sensors and triggers.
| https://airflow.apache.org/docs/apache-
| airflow/stable/author...
|
| But in the core it is built around data pipeline concept,
| event driven pipeline will much more fragile. Airflow
| intentionally doesn't manage business logic, it works with
| "tasks".
| annexrichmond wrote:
| Yes, but that means you are forced to build EDA on top of
| Airflow, which may not be ideal for many cases. You are
| stuck managing your pools/workers within Airflow's
| paradigm, which means all workload must (a) be written in
| Python and (b) have Airflow installed on the venv (very
| heavy pkg) and (c) be k8s pod or Celery (unless you write
| your own).
| nyrikki wrote:
| Only because you have chosen to introduce configuration
| and maintenance complexity by using airflow as enterprise
| wide middleware.
|
| In a modern even based SOA, products like airflow are a
| sometimes food while pub/sub is the default.
|
| Perhaps a search for images of the zachman framework
| would help conceptualize how you are tightly coupling to
| the implementation.
|
| But also research SOA 2.0, or event based SOA, the
| Enterprise Service Bus concept of the original SOA is as
| dead as COBRA.
|
| ETA: the minimal package load for airflow isn't bad, are
| you installing all of the plugins and their dependencies?
| annexrichmond wrote:
| We only use KubernetesOperators, but this has many
| downsides, and it's very clearly a 2nd thought of the
| Airflow project. It creates confusion because users of
| Airflow expect features A, B, and C, and when using
| KubernetesOperators they aren't functional because your
| biz logic is separated. Eg., if your biz logic knows what
| S3 it talks to in an external task, how can Airflow? So
| now its Dataset feature is useless.
|
| There are a number of blog posts echoing a similar
| critique[1].
|
| Using KubernetesOperators creates a lot of wrong
| abstractions, impedes testability, and makes Airflow as a
| whole a pretty overkill system just to monitor external
| tasks. At that point, you should have just had your
| orchestration in client code to begin with, and many
| other frameworks made this correct division between
| client and server. That would also make it easier to
| support multiple languages.
|
| According to their README:
| https://github.com/apache/airflow#approach-to-
| dependencies-o...
|
| > Airflow has a lot of dependencies - direct and
| transitive > The important dependencies are: SQLAlchemy,
| Alembic, Flask, werkzeug, celery, kubernetes
|
| Why should biz logic that just needs to run Spark and
| interact with S3 now need to run a web server?
|
| [1] Anecdotes from various posts -
| https://medium.com/bluecore-engineering/were-all-using-
| airfl... - https://eng.lyft.com/orchestrating-data-
| pipelines-at-lyft-co... -
| https://dagster.io/blog/dagster-airflow
|
| > Airflow, in its design, made the incorrect abstraction
| by having Operators actually implement functional work
| instead of spinning up developer work.
|
| > By simply moving to using a Kubernetes Operator,
| Airflow developers can develop more quickly, debug more
| confidently, and not worry about conflicting package
| requirements.
|
| > Airflow lacks proper library isolation. It becomes hard
| or impossible to do if any team requires a specific
| library version for a given workflow
|
| > There is no way to separate DAGs to development,
| staging, and production using out-of-the-box Airflow
| features. That makes Airflow harder to use for mission-
| critical applications that require proper testing and the
| ability to roll back
|
| > Data pipelines written for Airflow are typically bound
| to a particular environment. To avoid dependency hell,
| most guides recommend defining Airflow tasks with
| operators like the KubernetesPodOperator, which dictates
| that the task gets executed in Kubernetes. When a DAG is
| written in this way, it's nigh-impossible to run it
| locally or as part of CI. And it requires opting out of
| all of the integrations that come out-of-the-box with
| Airflow.
| adammarples wrote:
| If you need to convert a scheduled pipeline into some event
| driven architecture then yes, it will need a rewrite. Is
| there any case in which this wouldn't be true? What does it
| mean to be "interoperable"? Airflow drags can be triggered
| by events or they can trigger events if needs be. I admit
| it is not designed to do event streaming though.
| nyrikki wrote:
| It sounds like you are tightly coupling to implementation
| and also expecting this to be the rightfully maligned ESB.
|
| Airflow, aws step functions, etc are complex event
| processors and shouldn't typically be used as wide as you
| are suggesting.
|
| It is a common pitfall to accidentally build distributed
| monoliths, and in fact often requires active architectural
| governance to avoid it.
|
| Airflow isn't perfect, but I highly recommend considering a
| review on why modernization efforts fail.
|
| The what and the how shouldn't be tightly coupled,
| especially across capabilities.
|
| But yes. Airflow makes a poor ESB/DTC, but that is not the
| target for the project.
| politelemon wrote:
| > I expect a framework to fit nicely with existing biz logic.
|
| I wouldn't, that's just describing a custom codebase. Airflow
| comes with bells and whistles that can be taken advantage of if
| you fit your execution in their DAG model, that's all. Those
| bells and whistles can be an excellent pattern to work with.
| You can go all in on operators, or you can just have an
| operator call out to an external task that does all the work
| and only rely on Airflow for its retry/alerting mechanism while
| keeping the external task in language of your choice.
| annexrichmond wrote:
| Airflow owning the scheduling, retry, task
| branching/dependencies is fine. But a few issues: to take
| advantage of Airflow's core features (XCOMs, task mapping,
| etc) your biz logic/tasks must be in the same venv, which is
| not sustainable as teams grow. Once you have external tasks,
| you now have a more complex system to operate and test, and
| your Airflow installation is now very overkill (you need a
| pool of workers just to monitor external... workers?).
___________________________________________________________________
(page generated 2024-08-12 23:01 UTC)