[HN Gopher] New Apache Airflow Operators for Google Generative AI
       ___________________________________________________________________
        
       New Apache Airflow Operators for Google Generative AI
        
       Author : seeyam
       Score  : 32 points
       Date   : 2024-08-12 13:46 UTC (9 hours ago)
        
 (HTM) web link (cloud.google.com)
 (TXT) w3m dump (cloud.google.com)
        
       | ssahoo wrote:
       | So they added three brand new Airflow operators to interact with
       | Vertex AI's generative models to their Cloud Composer
        
       | annexrichmond wrote:
       | I can't imagine why people use Airflow these days. Its DAG DSL
       | means you have to fit your biz logic to their paradigm. I expect
       | a framework to fit nicely with existing biz logic. And that means
       | you're essentially stuck. It has no interoperability; hence the
       | need to build custom operators for everything.
       | 
       | It doesn't scale for teams because the package is incredibly
       | bloated, so once you need to run multiple images via K8s
       | operators, you lose out on a lot of other Airflow functionality
       | because it all assumes you have your biz logic embedded within
       | DAGs, but at the same time I don't know why anyone would develop
       | this way.
        
         | ssahoo wrote:
         | What alternative do you recommend?
        
           | annexrichmond wrote:
           | I've done a lot evaluation of such frameworks, and I hope to
           | publish more on it. It really depends on your requirements. I
           | would look at Prefect, Flyte, Dagster, and Temporal ahead of
           | Airflow, though.
        
             | whalesalad wrote:
             | We've been running Dagster for ~6 months on ECS and it has
             | been rock solid. (knocks on wood)
        
             | 0cf8612b2e1e wrote:
             | Airflow is kind of the default platform. If you do not have
             | a goto alternative which is superior in multiple dimensions
             | today, I think that says the problem itself is hard.
             | 
             | Might as well go with the devil everyone knows.
        
             | 8organicbits wrote:
             | Where do you think you'd publish it? I'm interested to read
             | the details.
        
           | computershit wrote:
           | Prefect has more polish and is easier to get started than any
           | of the existing options. We've been running their self-hosted
           | for over three years and it basically stays out of the way.
           | 
           | We looked at Dagster as well as Airflow. I really, really
           | liked Dagster but the BI team didn't.
           | 
           | I cannot imagine using Airflow for anything meaningful and
           | respecting myself at the end of a work day. The local
           | development experience was abysmal. Deployments sucked.
           | 
           | That being said, if you're not using anything except maybe
           | cron right now, and if you don't care about the solution
           | being a proper data pipelines orchestration platform
           | trademark symbol, I'd recommend starting with Windmill.
        
         | nyrikki wrote:
         | The DAG constraints are a feature to prevent architectural
         | erosion.
         | 
         | Note that any feedforward neural network (e.g. anything using
         | attention) is also a DAG.
         | 
         | You can always encapsulate more complex logic in a single
         | subroutine or even choose a hex or onion pattern for biz logic.
         | 
         | But what do you suggest besides DAGs + saga patterns that
         | doesn't result in a ball of mud over time for distributed
         | systems?
        
           | annexrichmond wrote:
           | DAGs are fine. Their DSL is not because it's abstracting the
           | wrong things in the wrong place. It's a global file with
           | static definitions; why do you hardcode KubernetesOperator
           | when maybe you don't want a KubernetesOperator in a test env?
           | There is also no type safety between tasks/operators. And
           | it's an extremely dependency heavy package with no
           | client/server isolation so bundling Airflow for multiple
           | teams is just not viable.
        
             | lyu07282 wrote:
             | > It's a global file with static definitions
             | 
             | Why do you think you define DAGs in Python? The point is to
             | be dynamic, exactly to do things like switching between
             | operator types based on things like the environment. Sorry
             | you really don't seem to know a lot about airflow for your
             | strong opinions against it, I'm out, no offense intended at
             | all.
        
               | annexrichmond wrote:
               | I'm well aware that's "possible", but if you have to
               | build your own abstractions and CI/CD to make it usable
               | this way, it doesn't seem very well designed.
        
         | lyu07282 wrote:
         | > hence the need to build custom operators for everything
         | 
         | But isn't that the point? I never got the impression that you
         | were supposed to build everything with the built-in operators,
         | they are just the "batteries included" part you can wrap or
         | extend. I really don't understand the criticism.
         | 
         | My criticism was always xcom, but that's a moot point now that
         | we have TaskFlow. Airflow is awesome and very flexible you just
         | have to adapt to how it's supposed to be used instead of
         | fighting against it I find.
        
           | annexrichmond wrote:
           | That might be fine for small teams, but any 100+ person
           | company would already have their own abstractions and don't
           | need to reinvent those wheels just for Airflow. But even if
           | you're smaller I still think you're setting up for failure if
           | all your biz logic is within Airflow unless you are careful
           | about making it reusable/shareable for other contexts. Eg,
           | what if you want to convert some scheduled pipeline to some
           | event-driven architecture with other systems? That means
           | needing to refactor everything out. It's not interoperable or
           | modular and that's why it should be avoided.
        
             | nooorofe wrote:
             | > if you want to convert some scheduled pipeline to some
             | event-driven architecture
             | 
             | Airflow has sensors and triggers.
             | https://airflow.apache.org/docs/apache-
             | airflow/stable/author...
             | 
             | But in the core it is built around data pipeline concept,
             | event driven pipeline will much more fragile. Airflow
             | intentionally doesn't manage business logic, it works with
             | "tasks".
        
               | annexrichmond wrote:
               | Yes, but that means you are forced to build EDA on top of
               | Airflow, which may not be ideal for many cases. You are
               | stuck managing your pools/workers within Airflow's
               | paradigm, which means all workload must (a) be written in
               | Python and (b) have Airflow installed on the venv (very
               | heavy pkg) and (c) be k8s pod or Celery (unless you write
               | your own).
        
               | nyrikki wrote:
               | Only because you have chosen to introduce configuration
               | and maintenance complexity by using airflow as enterprise
               | wide middleware.
               | 
               | In a modern even based SOA, products like airflow are a
               | sometimes food while pub/sub is the default.
               | 
               | Perhaps a search for images of the zachman framework
               | would help conceptualize how you are tightly coupling to
               | the implementation.
               | 
               | But also research SOA 2.0, or event based SOA, the
               | Enterprise Service Bus concept of the original SOA is as
               | dead as COBRA.
               | 
               | ETA: the minimal package load for airflow isn't bad, are
               | you installing all of the plugins and their dependencies?
        
               | annexrichmond wrote:
               | We only use KubernetesOperators, but this has many
               | downsides, and it's very clearly a 2nd thought of the
               | Airflow project. It creates confusion because users of
               | Airflow expect features A, B, and C, and when using
               | KubernetesOperators they aren't functional because your
               | biz logic is separated. Eg., if your biz logic knows what
               | S3 it talks to in an external task, how can Airflow? So
               | now its Dataset feature is useless.
               | 
               | There are a number of blog posts echoing a similar
               | critique[1].
               | 
               | Using KubernetesOperators creates a lot of wrong
               | abstractions, impedes testability, and makes Airflow as a
               | whole a pretty overkill system just to monitor external
               | tasks. At that point, you should have just had your
               | orchestration in client code to begin with, and many
               | other frameworks made this correct division between
               | client and server. That would also make it easier to
               | support multiple languages.
               | 
               | According to their README:
               | https://github.com/apache/airflow#approach-to-
               | dependencies-o...
               | 
               | > Airflow has a lot of dependencies - direct and
               | transitive > The important dependencies are: SQLAlchemy,
               | Alembic, Flask, werkzeug, celery, kubernetes
               | 
               | Why should biz logic that just needs to run Spark and
               | interact with S3 now need to run a web server?
               | 
               | [1] Anecdotes from various posts -
               | https://medium.com/bluecore-engineering/were-all-using-
               | airfl... - https://eng.lyft.com/orchestrating-data-
               | pipelines-at-lyft-co... -
               | https://dagster.io/blog/dagster-airflow
               | 
               | > Airflow, in its design, made the incorrect abstraction
               | by having Operators actually implement functional work
               | instead of spinning up developer work.
               | 
               | > By simply moving to using a Kubernetes Operator,
               | Airflow developers can develop more quickly, debug more
               | confidently, and not worry about conflicting package
               | requirements.
               | 
               | > Airflow lacks proper library isolation. It becomes hard
               | or impossible to do if any team requires a specific
               | library version for a given workflow
               | 
               | > There is no way to separate DAGs to development,
               | staging, and production using out-of-the-box Airflow
               | features. That makes Airflow harder to use for mission-
               | critical applications that require proper testing and the
               | ability to roll back
               | 
               | > Data pipelines written for Airflow are typically bound
               | to a particular environment. To avoid dependency hell,
               | most guides recommend defining Airflow tasks with
               | operators like the KubernetesPodOperator, which dictates
               | that the task gets executed in Kubernetes. When a DAG is
               | written in this way, it's nigh-impossible to run it
               | locally or as part of CI. And it requires opting out of
               | all of the integrations that come out-of-the-box with
               | Airflow.
        
             | adammarples wrote:
             | If you need to convert a scheduled pipeline into some event
             | driven architecture then yes, it will need a rewrite. Is
             | there any case in which this wouldn't be true? What does it
             | mean to be "interoperable"? Airflow drags can be triggered
             | by events or they can trigger events if needs be. I admit
             | it is not designed to do event streaming though.
        
             | nyrikki wrote:
             | It sounds like you are tightly coupling to implementation
             | and also expecting this to be the rightfully maligned ESB.
             | 
             | Airflow, aws step functions, etc are complex event
             | processors and shouldn't typically be used as wide as you
             | are suggesting.
             | 
             | It is a common pitfall to accidentally build distributed
             | monoliths, and in fact often requires active architectural
             | governance to avoid it.
             | 
             | Airflow isn't perfect, but I highly recommend considering a
             | review on why modernization efforts fail.
             | 
             | The what and the how shouldn't be tightly coupled,
             | especially across capabilities.
             | 
             | But yes. Airflow makes a poor ESB/DTC, but that is not the
             | target for the project.
        
         | politelemon wrote:
         | > I expect a framework to fit nicely with existing biz logic.
         | 
         | I wouldn't, that's just describing a custom codebase. Airflow
         | comes with bells and whistles that can be taken advantage of if
         | you fit your execution in their DAG model, that's all. Those
         | bells and whistles can be an excellent pattern to work with.
         | You can go all in on operators, or you can just have an
         | operator call out to an external task that does all the work
         | and only rely on Airflow for its retry/alerting mechanism while
         | keeping the external task in language of your choice.
        
           | annexrichmond wrote:
           | Airflow owning the scheduling, retry, task
           | branching/dependencies is fine. But a few issues: to take
           | advantage of Airflow's core features (XCOMs, task mapping,
           | etc) your biz logic/tasks must be in the same venv, which is
           | not sustainable as teams grow. Once you have external tasks,
           | you now have a more complex system to operate and test, and
           | your Airflow installation is now very overkill (you need a
           | pool of workers just to monitor external... workers?).
        
       ___________________________________________________________________
       (page generated 2024-08-12 23:01 UTC)