[HN Gopher] ETL Pipelines with Airflow: The Good, the Bad and th...
___________________________________________________________________
ETL Pipelines with Airflow: The Good, the Bad and the Ugly
Author : Arimbr
Score : 112 points
Date : 2021-10-08 14:33 UTC (8 hours ago)
(HTM) web link (airbyte.io)
(TXT) w3m dump (airbyte.io)
| ltbarcly3 wrote:
| This is just advertising copy. It isn't giving unbiased advice,
| there is an obvious conflict of interest here.
|
| I was hoping to learn some things to help me avoid common airflow
| problems, but it only talks about database sync jobs for the most
| part and jumps into pitching Airbyte over and over, the last 2/3
| of the article being a sales pitch, right out of a marketing
| class.
|
| Hi to the author and other Airbyte employees pushing this to the
| frontpage! I hope you didn't have to get up too early to
| coordinate your voting. Make sure to give this comment a downvote
| so we know you are out there!
| mkw5053 wrote:
| Having always been on AWS and using Glue Spark Jobs for my jobs,
| I've never felt any benefit of using Airflow for orchestration
| over Glue Workflows. I can understand some people not wanting to
| deal with vendor lock-in. I'm curious what others opinions are.
| higeorge13 wrote:
| For glue jobs you probably don't need something more. However
| step functions is the norm for creating workflows in aws
| nowadays.
| zepmck wrote:
| If you have GPUs, NVTabular outperforms most of the frameworks
| out there: https://github.com/NVIDIA/NVTabular
| fredliu wrote:
| I've never used Airflow, but used Step Function in AWS to pretty
| much achieved the same things this article described. I wonder if
| anybody has used both and what are the pros and cons between
| them? Besides the obvious reason of Step Function in AWS so it
| would work better within AWS ecosystem and Airflow is open source
| and service/provider agnostic?
| higeorge13 wrote:
| A couple points from my end:
|
| - I miss the airflow ui to monitor the workflow execution,
| clear failed tasks, its pre built notifications, emails and so
| on.
|
| - Passing state between step functions is a bit tricky tbh. I
| found airflow's way a little more straightforward.
|
| - Step functions seem to scale better, especially when you aim
| to use it for dynamic workflows. The recent update of step
| functions where you can literally use any aws service api
| (compared to a couple weeks ago where you were restricted to
| around 10 and then you only had lambdas) minimises its
| weaknesses imo.
| fredliu wrote:
| Ya, the new Step Function service integration feature is
| awesome, really seals the deal for Step Function being _the_
| orchestration tool if you are in AWS.
| rockostrich wrote:
| Yea they look pretty similar. Not sure what the configuration
| language around step function is, but Airflow is nice because
| it's just python. There are a few gotchas you encounter,
| especially around templating, when you first get started but
| besides that it's pretty low overhead to start doing some
| pretty complex things since it's just writing python code.
| fredliu wrote:
| Step Function uses JSON to describe the DAG, or if you use
| CDK you can define the DAG using any supported language
| (typescript, python, etc.).
| willvarfar wrote:
| Very solid article. Even with where it's published, it's jolly
| sensible.
|
| I would like to jump in and say use Beam instead of DBT, but tbh
| that's bad advice. What the world needs is something open source
| with the incremental model of beam, a fast incremental backend
| (thinking htap storage that mixes columns and rows automagically)
| and the ease and maintainability of DBT. There is just this
| massive hole. If some combination of tools could fill it, that
| would be the new LAMP stack for data.
| chrisjc wrote:
| Are you perhaps talking about something like
| https://materialize.com/ ? (btw, dbt now has some materialize
| compatibility)
|
| Maybe Pravega and Beam working together?
| https://pravega.io/docs/v0.6.0/key-features/
|
| Another option is something like Snowflake with tasks and
| streams. https://docs.snowflake.com/en/user-guide/tasks-
| intro.html
|
| Or Snowflake with change streams, dbt and scheduler in
| combination with lambda views.
| https://discourse.getdbt.com/t/how-to-create-near-real-time-...
| wara23arish wrote:
| Are you talking about Apache Beam? I happened to land a job
| where I had learned how to create dataflow pipelines using
| Apache Beam (java) on gcp.
|
| Im a little worried that I might be investing my time on a
| skill/tool that isn't that much in demand. (Instead of common
| backend/frontend development)
| joeswartz wrote:
| Apache Beam is a quite nice and flexible tool. If you prefer
| Scala, then you can try Scio:
| https://github.com/spotify/scio.
| Arimbr wrote:
| Oh, you should check Materialize. I feel Materialize is like
| dbt but with an ingestion layer and real-time materialized
| views.
|
| To deliver that you need to centralize data on their the
| Materialize database, which is may main caveat. With dbt you
| can use any data warehouse.
| 41b696ef1113 wrote:
| Has anyone kicked the tires on Airflow, Prefect, and Dagster and
| care to give their thoughts? My initial foray into Airflow 1 met
| a lot of complexity that both Prefect and Dagster claim to
| minimize.
| yamrzou wrote:
| I have worked with Airflow during the past three years, but
| recently we adopted Dagster and I have been using it for the past
| 3 months. I have found it quite joyful to use and the experience
| has been very positive. Its main advantages compared to Airlfow
| (IMO): - A great UI - It forces you to
| clearly define inputs, outputs and types. - Separation of
| concerns: Between configuration and data, between processing and
| IO, and between code and deployment options. - It allows
| you to define flexible dags which you can configure at runtime,
| which makes it easiy to run locally or in k8s, or to switch the
| storage backend depending on the environment.
|
| This blog post by the founder outlines the differences between
| the two in much more detail: https://dagster.io/blog/dagster-
| airflow
| StreamBright wrote:
| I was looking for a replacement for Airflow right now. This is
| really helpful.
| swyx wrote:
| > The main issue with Airflow transfer operators is that if you
| want to support transfers from M sources to N destinations, the
| community would need to code N x M Airflow operators.
|
| I'm biased but this is a nonissue with workflow-as-code solutions
| like temporal.io (which Airbyte uses). N activities pulling data
| from sources, M activities sending data to destinations, write
| whatever translation layers you want in your workflows.
|
| links to examples https://temporal.io/usecases#Pipelines and our
| community meetup where Airbyte spoke about their needs
| https://www.youtube.com/watch?v=K25Bt5asd8I
| Arimbr wrote:
| [author of the article] My main concern about using Airflow for
| the EL parts is that sources and destinations are highly coupled
| with Airflow transfer operators (e.g.
| PostgresToBigQueryOperator). The community needs to provide M * N
| operators to cover all possible transfers. Other open-source
| projects like Airbyte, decouple sources from destinations, so the
| community only needs to contribute 2 * (M + N) connectors.
|
| Another concern about using Airflow for the T part is that you
| need to code the dependencies between models both in your SQL
| files and your Airflow DAG. Other open-source projects like dbt
| create a DAG from the model dependencies in the SQL files.
|
| So I advocate for integrating Airflow scheduler with Airbyte and
| dbt.
|
| Curious to know how other use Airflow for ETL/ELT pipelines?
| contravariant wrote:
| I mostly don't bother writing separate Operators. The only part
| I write are the so called Hooks (which are basically just
| airflow's way of defining a standard way of grabbing
| credentials and instantiating a session object).
|
| After that you just write a short python function that grabs
| the data from one hook and pushes it to another. Which is
| basically the (M + N) solution you mention (I think the factor
| 2 is unnecessary if you've already split sources and sinks).
|
| This approach works with anything you can connect to python.
| Though for particularly large datasets you want to be careful
| that you don't accidentally store all data in memory at once.
| And sure you can sometimes specialize an operation for a
| particular use case (e.g. if in your example can instruct
| BigQuery to connect to the Postgres application natively), but
| usually it works just fine to use a python script in-between.
| verdverm wrote:
| Is there really a 1-1 mapping between SQL and T?
|
| What about use cases where the data lives in an object store?
| How does dbt deal with that?
| Arimbr wrote:
| My understanding of dbt is that it builds a DAG based on the
| interdepencies between models. The interdepencies are parsed
| from 'ref' functions on the SQL files. The thing with dbt is
| that you transform the data within a single data warehouse.
|
| So, you would normally first load all data to the data
| warehouse. Then dependencies between SQL models are easier to
| map.
| ibains wrote:
| We work with many businesses that are larger (Fortune 500)
| and the T per pipeline is say 60 steps with 1200 columns at
| 10TB scale and uses multiple things not in SQL. They lookup
| object stores, lookup web services, use rocksdb, partitioning
| is important. At scale, cost becomes critical- some are even
| moving to their own Spark on Kubernetes. ML on done on data
| after ETL into Data Lake.
|
| None of them can use DBT for core ETL, but DBT might be good
| later for views, some dimensional modeling. They have done a
| good job here.
|
| Think of it as the modern small-scale data stack.
| verdverm wrote:
| Have you explored Cuelang for T?
| rockostrich wrote:
| Is that really that bad of a problem? In the worst case, you
| use the underlying hooks to create a custom operator and in the
| worst worst case you do something like have a
| Spark/Dataflow/k8s pod/whatever other single threaded or
| distributed task runner act as that custom operator.
|
| I'm running into the "incremental load" problem now trying to
| batch intervals of Avro files into BigQuery, but Airflow
| doesn't seem to complain much about 10 minute intervals (less
| than that and I'd be a bit worried). One workaround we're
| considering to having 144 dag runs/dag/day for 10 minute
| intervals is to just define it as a daily dag and periodically
| clear it throughout the day.
|
| I have heard of newer companies who don't need to deal with
| legacy ETL going the ELT route which is interesting to me. I'm
| curious who owns the transformation in that case. Do you just
| hire more data scientists or maybe a bunch of junior
| devs/analysts/DSs to own data cleaning and transformation? The
| way our transformation works is that the engineering team that
| owns the application data owns the transformation so they can
| document exactly what the data going into the BI tool is. It
| adds overhead on the engineering side, but not nearly as much
| as throwing the data in a lake and hoping someone downstream
| knows what to do with it.
| Grimm1 wrote:
| In a previous job we/I trained the analysts to be more
| technical and write the T part of ELT in DBT. They
| effectively became what is known as "Analytics Engineers" so
| they owned the T and then wrote their analysis on top of the
| models they had created.
| rockostrich wrote:
| That works for ELT, especially if you have documentation
| around the raw data being loaded in but sounds like it adds
| a bit of overhead to the analysts' jobs which may or may
| not be more than just having the engineering team own it
| and document it well (something they already have to do for
| the analysts to write transformation code). I'm curious how
| you handle the upstream data schema changing. Loading in
| raw data means handling compatibility in another place
| outside the application.
| KptMarchewa wrote:
| In the end, analysts are cheaper than engineers.
| rockostrich wrote:
| Not if it's just a part of those engineers' jobs. They're
| already familiar with the underlying application data so
| owning the transformation is just understanding what the
| data needs to look like and documenting it. They're going
| to need to document the raw data anyway to avoid those
| analysts asking them a million questions. Might as well
| avoid hiring analysts who can also learn the
| transformation bit and just give them good data.
| Arimbr wrote:
| Really good points! I don't think that Airflow is necessarily
| a problem if your data engineering team knows how to best use
| Airflow Operators, Hooks and DAGs for incremental loads. But
| because Airflow is not an opinionated ETL/ELT tool, most
| often I see a lot of custom code that could be improved...
|
| You know there is this "data mesh" hype now. I think the idea
| behind is to empower data consumers within the company (data
| analysts) who know best the data to create and maintain the
| models. That's easier said than done, and most often turns
| out into a worst situation than when is only data engineers
| who can model data... I've only heard of Zalando who has
| successfully distributed data ownership within the company.
| rockostrich wrote:
| Yea, I wasn't familiar with Airbyte before writing that
| comment so now I'm seeing the value in it. We have tons of
| teams asking "how do I get this data into BigQuery" and the
| answer is usually "use this airflow operator to dump it
| into GCS and then use this airflow operator to load it into
| BigQuery" which isn't super useful for a non-technical
| person or even really any technical person not familiar
| with Airflow.
|
| A mesh is certainly something in-between a lake and a
| warehouse... Something super simple that I've gotten good
| feedback on so far from DSs is just documenting the
| transformed data in place. It was really difficult to do
| this in our old ETL stack (data pulled from HBase,
| transformed to parquet + Hive in HDFS) but we've moved a
| lot of it over to Avro files loaded into BigQuery where we
| can just put decorators on our Scala transformation code
| that's writing the Avro files and that updates the schema
| with descriptions in BigQuery. Gives a nice bit of
| ownership to the engineering team and lets the DS using the
| data be a lot more autonomous. That boundary has to exist
| somewhere (or I guess in many places for a "mesh") so
| having it distinctly at data getting loaded in feels right
| to me.
| Arimbr wrote:
| Nice work there! I also think that the next challenge for
| data teams is all this data documentation and discovery
| work.
|
| I still think that Airflow is great for power data
| engineers. Airbyte and dbt are positioned to empower data
| analysts (or lazy data engineers like me) to own the
| ELTs.
| rockostrich wrote:
| Agreed. I see a lot of folks coming up with one off
| solutions for pulling data out of 3rd party sources like
| Kustomer or Lever. Giving a centralized UI for setting
| that up would be a great service.
|
| Seems like I have a fun weekend project.
| smarx007 wrote:
| I am curious if Airflow is the most appropriate tool for such
| tasks. I would imagine Apache Camel to be a good starting point
| for building a library of connectors and then plugging them
| together using EIP [1] patterns?
|
| [1]: https://martinfowler.com/books/eip.html
| DevKoala wrote:
| My team is currently evaluating Debezium for a PG to S3 CDC
| solution, and is very advanced with the PoC.
|
| What would be the argument for trying AirByte? It is the first
| time we hear about it.
| Arimbr wrote:
| Airbyte CDC is based on Debezium, but Airbyte abstracts it
| away and make it easier to CDC from Postgres, MySQL, MSSQL to
| any supported destination (included S3). Here is the doc for
| CDC: https://docs.airbyte.io/understanding-airbyte/cdc
|
| I guess one benefit is that you can use Airbyte for all your
| data syncs, CDC and non-CDC. You can give it a try with your
| own data, and see if it's easier for your team. You can run
| Airbyte locally with Docker Compose:
| https://docs.airbyte.io/quickstart/deploy-airbyte
| gunnarmorling wrote:
| Can you elaborate on how Airbyte makes things easier for a
| user? Would love to pick up any potential improvements in
| Debezium itself, so that all its users get to benefit from
| them, rather than only users of a specific integrator like
| Airbyte.
|
| Disclaimer: I work on Debezium
| pid-1 wrote:
| 1 - Hide your ETL logic in REST APIs and use common
| microservice patterns for monitoring, logging, etc...
|
| 2 - Use PythonOperator to call your APIs.
|
| It's working really well for us.
| agustif wrote:
| I once made a simple ETL concept library for a coding challenge,
| I actually never published it, and it's not like performant or
| anything, but maybe I should open source it?
| SPascareli13 wrote:
| Something I've been thinking is, like the article says, SQL is a
| very good way to transform data, and it recommends dbt for it,
| but how to you test this transformation?
|
| I know dbt has tests, but on a superficial look they seem pretty
| trivial stuff like "check if this field is null" and things like
| that, but what about tests which I setup scenarios and see if the
| end result of my transformation is what I expect? Is there any
| good tools for this?
| msinxi wrote:
| Checkout great expectations (https://greatexpectations.io/).
| You basically run assertions on your data. It auto-generates
| documentation for you and you can also store results of your
| validations in a flat file or database. I think there are dbt
| modules that try to mimic great expectations' validation but I
| like the docs that come with great expectations
| TrealTwan wrote:
| The article says that "SQL is taking over Python to transform and
| analyze data in the modern data stack". Are other people starting
| to notice this at ELT becomes more populate than traditional ETL?
|
| Haven't used Airflow before but use Azure Data Factory in my org
| to load the raw the data into the data warehouse and then
| transform into data models using SQL.
| CRConrad wrote:
| > Are other people starting to notice this at ELT becomes more
| populate than traditional ETL?
|
| I invented ELT.
|
| No, duh, of course not... _But:_ I and my then-colleagues,
| other DW consultants, started noticing some time in the early-
| to-mid-00s that what we were doing didn 't actually follow the
| order of the "ETL" acronym, and occasionally commented to each
| other along the lines of "shouldn't this be called 'ELT' to be
| more descriptive?". So I don't quite get the _" starting to
| notice this at ELT becomes more popula[r]"_ bit -- this feels
| quite old to me. Before most of the meteoric rise of Python.
| psychometry wrote:
| If your ETL code is written in Python, just use Prefect and thank
| me later: https://docs.prefect.io/
| yamrzou wrote:
| Could you elaborate?
| 0x500x79 wrote:
| Some people have noted that this is a very Airbyte specific
| article, but I think that the lessons learned are still
| important.
|
| I have managed Airflow as a managed service for a company that
| has thousands of DAGs and one of our keys to success was
| splitting the compute and scheduling concepts into different
| components. We standardized on where our compute ran (Databricks,
| Spark, Lambdas, or K8s jobs) and had Airflow purely as a
| scheduler/orchestration tool.
|
| Scaling your Airflow worker nodes to handle big-data scale
| transformations/extractions is a pain. Especially attempting to
| support customers who want to run larger and larger jobs.
| Splitting these concepts allowed for us to prevent noisy neighbor
| issues, Airflow as a component had high reliability for all of
| our customers, and we prevented the need for M * N operators.
| sails wrote:
| How best to achieve this? I'm considering Astronomer or AWS
| hosted airflow, with a benefit to AWS having the compute
| components easily accessible within the AWS ecosystem.
|
| Starting out with a smaller scale and lower commitment to
| Airflow, so I'd like highest reliability with least hassle.
|
| https://aws.amazon.com/blogs/aws/introducing-amazon-managed-...
| 0x500x79 wrote:
| It depends on what other tools you are currently using for
| ETL (or want to use). One example is that we used Spark so we
| would use the Spark submit operator to submit jobs to
| clusters. You can also use the K8s Pod operator if you want
| to utilize containers for your compute.
|
| There are a lot of options. We were adopters before AWS
| hosted airflow was a thing, so I don't have any experiencing
| running AWS hosted Airflow.
|
| I haven't looked recently to see if some of the challenges we
| faced early on are solved now, but most of them stemmed from
| how DAG updates were handled: changing the start date on a
| DAG would break you DAG forever until you go update the
| database by hand. Things like this are(/were?) super painful
| and could get worse with a managed solution.
| chrisjc wrote:
| We used Astronomer at my last job. Sure beats trying to run
| your own instance and the Astronomer team were fantastic.
| higeorge13 wrote:
| If you are on aws, then step functions could be your best
| option. They are responsible for the workflow definition, and
| the compute load is completely dependent on the underlying
| tasks and aws services you are going to use.
|
| Edit: typos
| pid-1 wrote:
| Amazon's Airflow offering sucks really hard. As in many
| services, they made an awful job at designing a good UX and
| hiding the product's complexity from their users.
|
| Astronomer is cool, but expensive and they won't accept
| monthly billing. Also, it's even more expensive if you need
| DAGS to access things within a VPC.
|
| Step Functions is decent for simple use cases and very
| reliable. For complex stuff, you will hate their JSON based
| DSL with passion.
| numlocked wrote:
| At Grove we make extensive use of KubernetesPodOperator[0].
| This allows to both encapsulate the job in containers (k8s
| pods), and also specify the resources we need for each DAG.
|
| [0] https://airflow.apache.org/docs/apache-airflow-providers-
| cnc...
| pid-1 wrote:
| +1
|
| In my org we never use Airflow to compute anything. We only use
| a single operator: PythonOperator.
|
| All business and data access logic is encapsuled in REST APIs.
| Our DAGs are responsible for calling those APIs in the correct
| order, in the correct time, retrying when needed. I really
| dislike using Airflow for anything else, as it generally
| becomes a huge mess.
___________________________________________________________________
(page generated 2021-10-08 23:01 UTC)