[HN Gopher] ETL Pipelines with Airflow: The Good, the Bad and th...
       ___________________________________________________________________
        
       ETL Pipelines with Airflow: The Good, the Bad and the Ugly
        
       Author : Arimbr
       Score  : 112 points
       Date   : 2021-10-08 14:33 UTC (8 hours ago)
        
 (HTM) web link (airbyte.io)
 (TXT) w3m dump (airbyte.io)
        
       | ltbarcly3 wrote:
       | This is just advertising copy. It isn't giving unbiased advice,
       | there is an obvious conflict of interest here.
       | 
       | I was hoping to learn some things to help me avoid common airflow
       | problems, but it only talks about database sync jobs for the most
       | part and jumps into pitching Airbyte over and over, the last 2/3
       | of the article being a sales pitch, right out of a marketing
       | class.
       | 
       | Hi to the author and other Airbyte employees pushing this to the
       | frontpage! I hope you didn't have to get up too early to
       | coordinate your voting. Make sure to give this comment a downvote
       | so we know you are out there!
        
       | mkw5053 wrote:
       | Having always been on AWS and using Glue Spark Jobs for my jobs,
       | I've never felt any benefit of using Airflow for orchestration
       | over Glue Workflows. I can understand some people not wanting to
       | deal with vendor lock-in. I'm curious what others opinions are.
        
         | higeorge13 wrote:
         | For glue jobs you probably don't need something more. However
         | step functions is the norm for creating workflows in aws
         | nowadays.
        
       | zepmck wrote:
       | If you have GPUs, NVTabular outperforms most of the frameworks
       | out there: https://github.com/NVIDIA/NVTabular
        
       | fredliu wrote:
       | I've never used Airflow, but used Step Function in AWS to pretty
       | much achieved the same things this article described. I wonder if
       | anybody has used both and what are the pros and cons between
       | them? Besides the obvious reason of Step Function in AWS so it
       | would work better within AWS ecosystem and Airflow is open source
       | and service/provider agnostic?
        
         | higeorge13 wrote:
         | A couple points from my end:
         | 
         | - I miss the airflow ui to monitor the workflow execution,
         | clear failed tasks, its pre built notifications, emails and so
         | on.
         | 
         | - Passing state between step functions is a bit tricky tbh. I
         | found airflow's way a little more straightforward.
         | 
         | - Step functions seem to scale better, especially when you aim
         | to use it for dynamic workflows. The recent update of step
         | functions where you can literally use any aws service api
         | (compared to a couple weeks ago where you were restricted to
         | around 10 and then you only had lambdas) minimises its
         | weaknesses imo.
        
           | fredliu wrote:
           | Ya, the new Step Function service integration feature is
           | awesome, really seals the deal for Step Function being _the_
           | orchestration tool if you are in AWS.
        
         | rockostrich wrote:
         | Yea they look pretty similar. Not sure what the configuration
         | language around step function is, but Airflow is nice because
         | it's just python. There are a few gotchas you encounter,
         | especially around templating, when you first get started but
         | besides that it's pretty low overhead to start doing some
         | pretty complex things since it's just writing python code.
        
           | fredliu wrote:
           | Step Function uses JSON to describe the DAG, or if you use
           | CDK you can define the DAG using any supported language
           | (typescript, python, etc.).
        
       | willvarfar wrote:
       | Very solid article. Even with where it's published, it's jolly
       | sensible.
       | 
       | I would like to jump in and say use Beam instead of DBT, but tbh
       | that's bad advice. What the world needs is something open source
       | with the incremental model of beam, a fast incremental backend
       | (thinking htap storage that mixes columns and rows automagically)
       | and the ease and maintainability of DBT. There is just this
       | massive hole. If some combination of tools could fill it, that
       | would be the new LAMP stack for data.
        
         | chrisjc wrote:
         | Are you perhaps talking about something like
         | https://materialize.com/ ? (btw, dbt now has some materialize
         | compatibility)
         | 
         | Maybe Pravega and Beam working together?
         | https://pravega.io/docs/v0.6.0/key-features/
         | 
         | Another option is something like Snowflake with tasks and
         | streams. https://docs.snowflake.com/en/user-guide/tasks-
         | intro.html
         | 
         | Or Snowflake with change streams, dbt and scheduler in
         | combination with lambda views.
         | https://discourse.getdbt.com/t/how-to-create-near-real-time-...
        
         | wara23arish wrote:
         | Are you talking about Apache Beam? I happened to land a job
         | where I had learned how to create dataflow pipelines using
         | Apache Beam (java) on gcp.
         | 
         | Im a little worried that I might be investing my time on a
         | skill/tool that isn't that much in demand. (Instead of common
         | backend/frontend development)
        
           | joeswartz wrote:
           | Apache Beam is a quite nice and flexible tool. If you prefer
           | Scala, then you can try Scio:
           | https://github.com/spotify/scio.
        
         | Arimbr wrote:
         | Oh, you should check Materialize. I feel Materialize is like
         | dbt but with an ingestion layer and real-time materialized
         | views.
         | 
         | To deliver that you need to centralize data on their the
         | Materialize database, which is may main caveat. With dbt you
         | can use any data warehouse.
        
       | 41b696ef1113 wrote:
       | Has anyone kicked the tires on Airflow, Prefect, and Dagster and
       | care to give their thoughts? My initial foray into Airflow 1 met
       | a lot of complexity that both Prefect and Dagster claim to
       | minimize.
        
       | yamrzou wrote:
       | I have worked with Airflow during the past three years, but
       | recently we adopted Dagster and I have been using it for the past
       | 3 months. I have found it quite joyful to use and the experience
       | has been very positive. Its main advantages compared to Airlfow
       | (IMO):                 - A great UI       - It forces you to
       | clearly define inputs, outputs and types.       - Separation of
       | concerns: Between configuration and data, between processing and
       | IO, and between code and deployment options.       - It allows
       | you to define flexible dags which you can configure at runtime,
       | which makes it easiy to run locally or in k8s, or to switch the
       | storage backend depending on the environment.
       | 
       | This blog post by the founder outlines the differences between
       | the two in much more detail: https://dagster.io/blog/dagster-
       | airflow
        
         | StreamBright wrote:
         | I was looking for a replacement for Airflow right now. This is
         | really helpful.
        
       | swyx wrote:
       | > The main issue with Airflow transfer operators is that if you
       | want to support transfers from M sources to N destinations, the
       | community would need to code N x M Airflow operators.
       | 
       | I'm biased but this is a nonissue with workflow-as-code solutions
       | like temporal.io (which Airbyte uses). N activities pulling data
       | from sources, M activities sending data to destinations, write
       | whatever translation layers you want in your workflows.
       | 
       | links to examples https://temporal.io/usecases#Pipelines and our
       | community meetup where Airbyte spoke about their needs
       | https://www.youtube.com/watch?v=K25Bt5asd8I
        
       | Arimbr wrote:
       | [author of the article] My main concern about using Airflow for
       | the EL parts is that sources and destinations are highly coupled
       | with Airflow transfer operators (e.g.
       | PostgresToBigQueryOperator). The community needs to provide M * N
       | operators to cover all possible transfers. Other open-source
       | projects like Airbyte, decouple sources from destinations, so the
       | community only needs to contribute 2 * (M + N) connectors.
       | 
       | Another concern about using Airflow for the T part is that you
       | need to code the dependencies between models both in your SQL
       | files and your Airflow DAG. Other open-source projects like dbt
       | create a DAG from the model dependencies in the SQL files.
       | 
       | So I advocate for integrating Airflow scheduler with Airbyte and
       | dbt.
       | 
       | Curious to know how other use Airflow for ETL/ELT pipelines?
        
         | contravariant wrote:
         | I mostly don't bother writing separate Operators. The only part
         | I write are the so called Hooks (which are basically just
         | airflow's way of defining a standard way of grabbing
         | credentials and instantiating a session object).
         | 
         | After that you just write a short python function that grabs
         | the data from one hook and pushes it to another. Which is
         | basically the (M + N) solution you mention (I think the factor
         | 2 is unnecessary if you've already split sources and sinks).
         | 
         | This approach works with anything you can connect to python.
         | Though for particularly large datasets you want to be careful
         | that you don't accidentally store all data in memory at once.
         | And sure you can sometimes specialize an operation for a
         | particular use case (e.g. if in your example can instruct
         | BigQuery to connect to the Postgres application natively), but
         | usually it works just fine to use a python script in-between.
        
         | verdverm wrote:
         | Is there really a 1-1 mapping between SQL and T?
         | 
         | What about use cases where the data lives in an object store?
         | How does dbt deal with that?
        
           | Arimbr wrote:
           | My understanding of dbt is that it builds a DAG based on the
           | interdepencies between models. The interdepencies are parsed
           | from 'ref' functions on the SQL files. The thing with dbt is
           | that you transform the data within a single data warehouse.
           | 
           | So, you would normally first load all data to the data
           | warehouse. Then dependencies between SQL models are easier to
           | map.
        
           | ibains wrote:
           | We work with many businesses that are larger (Fortune 500)
           | and the T per pipeline is say 60 steps with 1200 columns at
           | 10TB scale and uses multiple things not in SQL. They lookup
           | object stores, lookup web services, use rocksdb, partitioning
           | is important. At scale, cost becomes critical- some are even
           | moving to their own Spark on Kubernetes. ML on done on data
           | after ETL into Data Lake.
           | 
           | None of them can use DBT for core ETL, but DBT might be good
           | later for views, some dimensional modeling. They have done a
           | good job here.
           | 
           | Think of it as the modern small-scale data stack.
        
             | verdverm wrote:
             | Have you explored Cuelang for T?
        
         | rockostrich wrote:
         | Is that really that bad of a problem? In the worst case, you
         | use the underlying hooks to create a custom operator and in the
         | worst worst case you do something like have a
         | Spark/Dataflow/k8s pod/whatever other single threaded or
         | distributed task runner act as that custom operator.
         | 
         | I'm running into the "incremental load" problem now trying to
         | batch intervals of Avro files into BigQuery, but Airflow
         | doesn't seem to complain much about 10 minute intervals (less
         | than that and I'd be a bit worried). One workaround we're
         | considering to having 144 dag runs/dag/day for 10 minute
         | intervals is to just define it as a daily dag and periodically
         | clear it throughout the day.
         | 
         | I have heard of newer companies who don't need to deal with
         | legacy ETL going the ELT route which is interesting to me. I'm
         | curious who owns the transformation in that case. Do you just
         | hire more data scientists or maybe a bunch of junior
         | devs/analysts/DSs to own data cleaning and transformation? The
         | way our transformation works is that the engineering team that
         | owns the application data owns the transformation so they can
         | document exactly what the data going into the BI tool is. It
         | adds overhead on the engineering side, but not nearly as much
         | as throwing the data in a lake and hoping someone downstream
         | knows what to do with it.
        
           | Grimm1 wrote:
           | In a previous job we/I trained the analysts to be more
           | technical and write the T part of ELT in DBT. They
           | effectively became what is known as "Analytics Engineers" so
           | they owned the T and then wrote their analysis on top of the
           | models they had created.
        
             | rockostrich wrote:
             | That works for ELT, especially if you have documentation
             | around the raw data being loaded in but sounds like it adds
             | a bit of overhead to the analysts' jobs which may or may
             | not be more than just having the engineering team own it
             | and document it well (something they already have to do for
             | the analysts to write transformation code). I'm curious how
             | you handle the upstream data schema changing. Loading in
             | raw data means handling compatibility in another place
             | outside the application.
        
               | KptMarchewa wrote:
               | In the end, analysts are cheaper than engineers.
        
               | rockostrich wrote:
               | Not if it's just a part of those engineers' jobs. They're
               | already familiar with the underlying application data so
               | owning the transformation is just understanding what the
               | data needs to look like and documenting it. They're going
               | to need to document the raw data anyway to avoid those
               | analysts asking them a million questions. Might as well
               | avoid hiring analysts who can also learn the
               | transformation bit and just give them good data.
        
           | Arimbr wrote:
           | Really good points! I don't think that Airflow is necessarily
           | a problem if your data engineering team knows how to best use
           | Airflow Operators, Hooks and DAGs for incremental loads. But
           | because Airflow is not an opinionated ETL/ELT tool, most
           | often I see a lot of custom code that could be improved...
           | 
           | You know there is this "data mesh" hype now. I think the idea
           | behind is to empower data consumers within the company (data
           | analysts) who know best the data to create and maintain the
           | models. That's easier said than done, and most often turns
           | out into a worst situation than when is only data engineers
           | who can model data... I've only heard of Zalando who has
           | successfully distributed data ownership within the company.
        
             | rockostrich wrote:
             | Yea, I wasn't familiar with Airbyte before writing that
             | comment so now I'm seeing the value in it. We have tons of
             | teams asking "how do I get this data into BigQuery" and the
             | answer is usually "use this airflow operator to dump it
             | into GCS and then use this airflow operator to load it into
             | BigQuery" which isn't super useful for a non-technical
             | person or even really any technical person not familiar
             | with Airflow.
             | 
             | A mesh is certainly something in-between a lake and a
             | warehouse... Something super simple that I've gotten good
             | feedback on so far from DSs is just documenting the
             | transformed data in place. It was really difficult to do
             | this in our old ETL stack (data pulled from HBase,
             | transformed to parquet + Hive in HDFS) but we've moved a
             | lot of it over to Avro files loaded into BigQuery where we
             | can just put decorators on our Scala transformation code
             | that's writing the Avro files and that updates the schema
             | with descriptions in BigQuery. Gives a nice bit of
             | ownership to the engineering team and lets the DS using the
             | data be a lot more autonomous. That boundary has to exist
             | somewhere (or I guess in many places for a "mesh") so
             | having it distinctly at data getting loaded in feels right
             | to me.
        
               | Arimbr wrote:
               | Nice work there! I also think that the next challenge for
               | data teams is all this data documentation and discovery
               | work.
               | 
               | I still think that Airflow is great for power data
               | engineers. Airbyte and dbt are positioned to empower data
               | analysts (or lazy data engineers like me) to own the
               | ELTs.
        
               | rockostrich wrote:
               | Agreed. I see a lot of folks coming up with one off
               | solutions for pulling data out of 3rd party sources like
               | Kustomer or Lever. Giving a centralized UI for setting
               | that up would be a great service.
               | 
               | Seems like I have a fun weekend project.
        
         | smarx007 wrote:
         | I am curious if Airflow is the most appropriate tool for such
         | tasks. I would imagine Apache Camel to be a good starting point
         | for building a library of connectors and then plugging them
         | together using EIP [1] patterns?
         | 
         | [1]: https://martinfowler.com/books/eip.html
        
         | DevKoala wrote:
         | My team is currently evaluating Debezium for a PG to S3 CDC
         | solution, and is very advanced with the PoC.
         | 
         | What would be the argument for trying AirByte? It is the first
         | time we hear about it.
        
           | Arimbr wrote:
           | Airbyte CDC is based on Debezium, but Airbyte abstracts it
           | away and make it easier to CDC from Postgres, MySQL, MSSQL to
           | any supported destination (included S3). Here is the doc for
           | CDC: https://docs.airbyte.io/understanding-airbyte/cdc
           | 
           | I guess one benefit is that you can use Airbyte for all your
           | data syncs, CDC and non-CDC. You can give it a try with your
           | own data, and see if it's easier for your team. You can run
           | Airbyte locally with Docker Compose:
           | https://docs.airbyte.io/quickstart/deploy-airbyte
        
             | gunnarmorling wrote:
             | Can you elaborate on how Airbyte makes things easier for a
             | user? Would love to pick up any potential improvements in
             | Debezium itself, so that all its users get to benefit from
             | them, rather than only users of a specific integrator like
             | Airbyte.
             | 
             | Disclaimer: I work on Debezium
        
         | pid-1 wrote:
         | 1 - Hide your ETL logic in REST APIs and use common
         | microservice patterns for monitoring, logging, etc...
         | 
         | 2 - Use PythonOperator to call your APIs.
         | 
         | It's working really well for us.
        
       | agustif wrote:
       | I once made a simple ETL concept library for a coding challenge,
       | I actually never published it, and it's not like performant or
       | anything, but maybe I should open source it?
        
       | SPascareli13 wrote:
       | Something I've been thinking is, like the article says, SQL is a
       | very good way to transform data, and it recommends dbt for it,
       | but how to you test this transformation?
       | 
       | I know dbt has tests, but on a superficial look they seem pretty
       | trivial stuff like "check if this field is null" and things like
       | that, but what about tests which I setup scenarios and see if the
       | end result of my transformation is what I expect? Is there any
       | good tools for this?
        
         | msinxi wrote:
         | Checkout great expectations (https://greatexpectations.io/).
         | You basically run assertions on your data. It auto-generates
         | documentation for you and you can also store results of your
         | validations in a flat file or database. I think there are dbt
         | modules that try to mimic great expectations' validation but I
         | like the docs that come with great expectations
        
       | TrealTwan wrote:
       | The article says that "SQL is taking over Python to transform and
       | analyze data in the modern data stack". Are other people starting
       | to notice this at ELT becomes more populate than traditional ETL?
       | 
       | Haven't used Airflow before but use Azure Data Factory in my org
       | to load the raw the data into the data warehouse and then
       | transform into data models using SQL.
        
         | CRConrad wrote:
         | > Are other people starting to notice this at ELT becomes more
         | populate than traditional ETL?
         | 
         | I invented ELT.
         | 
         | No, duh, of course not... _But:_ I and my then-colleagues,
         | other DW consultants, started noticing some time in the early-
         | to-mid-00s that what we were doing didn 't actually follow the
         | order of the "ETL" acronym, and occasionally commented to each
         | other along the lines of "shouldn't this be called 'ELT' to be
         | more descriptive?". So I don't quite get the _" starting to
         | notice this at ELT becomes more popula[r]"_ bit -- this feels
         | quite old to me. Before most of the meteoric rise of Python.
        
       | psychometry wrote:
       | If your ETL code is written in Python, just use Prefect and thank
       | me later: https://docs.prefect.io/
        
         | yamrzou wrote:
         | Could you elaborate?
        
       | 0x500x79 wrote:
       | Some people have noted that this is a very Airbyte specific
       | article, but I think that the lessons learned are still
       | important.
       | 
       | I have managed Airflow as a managed service for a company that
       | has thousands of DAGs and one of our keys to success was
       | splitting the compute and scheduling concepts into different
       | components. We standardized on where our compute ran (Databricks,
       | Spark, Lambdas, or K8s jobs) and had Airflow purely as a
       | scheduler/orchestration tool.
       | 
       | Scaling your Airflow worker nodes to handle big-data scale
       | transformations/extractions is a pain. Especially attempting to
       | support customers who want to run larger and larger jobs.
       | Splitting these concepts allowed for us to prevent noisy neighbor
       | issues, Airflow as a component had high reliability for all of
       | our customers, and we prevented the need for M * N operators.
        
         | sails wrote:
         | How best to achieve this? I'm considering Astronomer or AWS
         | hosted airflow, with a benefit to AWS having the compute
         | components easily accessible within the AWS ecosystem.
         | 
         | Starting out with a smaller scale and lower commitment to
         | Airflow, so I'd like highest reliability with least hassle.
         | 
         | https://aws.amazon.com/blogs/aws/introducing-amazon-managed-...
        
           | 0x500x79 wrote:
           | It depends on what other tools you are currently using for
           | ETL (or want to use). One example is that we used Spark so we
           | would use the Spark submit operator to submit jobs to
           | clusters. You can also use the K8s Pod operator if you want
           | to utilize containers for your compute.
           | 
           | There are a lot of options. We were adopters before AWS
           | hosted airflow was a thing, so I don't have any experiencing
           | running AWS hosted Airflow.
           | 
           | I haven't looked recently to see if some of the challenges we
           | faced early on are solved now, but most of them stemmed from
           | how DAG updates were handled: changing the start date on a
           | DAG would break you DAG forever until you go update the
           | database by hand. Things like this are(/were?) super painful
           | and could get worse with a managed solution.
        
           | chrisjc wrote:
           | We used Astronomer at my last job. Sure beats trying to run
           | your own instance and the Astronomer team were fantastic.
        
           | higeorge13 wrote:
           | If you are on aws, then step functions could be your best
           | option. They are responsible for the workflow definition, and
           | the compute load is completely dependent on the underlying
           | tasks and aws services you are going to use.
           | 
           | Edit: typos
        
           | pid-1 wrote:
           | Amazon's Airflow offering sucks really hard. As in many
           | services, they made an awful job at designing a good UX and
           | hiding the product's complexity from their users.
           | 
           | Astronomer is cool, but expensive and they won't accept
           | monthly billing. Also, it's even more expensive if you need
           | DAGS to access things within a VPC.
           | 
           | Step Functions is decent for simple use cases and very
           | reliable. For complex stuff, you will hate their JSON based
           | DSL with passion.
        
           | numlocked wrote:
           | At Grove we make extensive use of KubernetesPodOperator[0].
           | This allows to both encapsulate the job in containers (k8s
           | pods), and also specify the resources we need for each DAG.
           | 
           | [0] https://airflow.apache.org/docs/apache-airflow-providers-
           | cnc...
        
         | pid-1 wrote:
         | +1
         | 
         | In my org we never use Airflow to compute anything. We only use
         | a single operator: PythonOperator.
         | 
         | All business and data access logic is encapsuled in REST APIs.
         | Our DAGs are responsible for calling those APIs in the correct
         | order, in the correct time, retrying when needed. I really
         | dislike using Airflow for anything else, as it generally
         | becomes a huge mess.
        
       ___________________________________________________________________
       (page generated 2021-10-08 23:01 UTC)