https://airbyte.io/blog/airflow-etl-pipelines

 
Product
 
[60d6366607]
Features
 
[6114a416a2]
Connectors
 
[60d63ed487]
Connector Development Kit
 
[60d63a6b65]
Pricing
[60d6350297]Check out the demo
Developers
 
[60d63ed429]
Docs
 
[60d63ed461]
Community
 
[6114a3e10d]
Contributors
 
[60ff3155c4]
Write for the community
 
[60d63ed46b]
Changelog
UNTIL 11/19
[6157959b6b]
Airbyte
Hacktoberfest
Win up to $1,000
[613af789]Join the contest
[613af7b8a2]
Resources
 
[60d63e2600]
Recipes
 
[60d63e2600]
Blog
 
[60d63e268a]
Roadmap
 
[60d63e2654]
Newsletter
 
[61011ce58c]
Case studies
 
[60d63e2700]
Compare us to...
 
[60d63e2650]
Company handbook
 
[60d63e2600]
Partners
 
[60d6366607]
About us
 
[61011d2b48]
Careers
Product
FeaturesConnectorsConnector Development KitDemo AppPricing
Developers
DocsCommunityContributorsWrite for the communityChangelogWeekly
Office Hours
Resources
RecipesBlogRoadmapNewsletterCase StudiesCompare us to...Company
HandbookPartnersAbout usCareers
GitHub
Star
Get Started

Blog

[                    ]
[]
[60c7809c7b]
 
All Articles
 
Our Story
 
Data Engineering
 
Changelog
 
Recipes
Share
Data Engineering
October 8, 2021
8 minutes

ETL Pipelines with Airflow: the Good, the Bad and the Ugly

[613994b9de]
Ari Bajo Rouvinen
ETL Pipelines with Airflow: the Good, the Bad and the Ugly

Airflow is a popular open-source workflow management platform. Many
data teams also use Airflow for their ETL pipelines. For example,
I've previously used Airflow transfer operators to replicate data
between databases, data lakes and data warehouses. I've also used
Airflow operators to preprocess data for machine learning algorithms.
But is using Airflow for your ETL pipelines a good practice today?

In this article, we review how to use Airflow ETL operators to
transfer data from Postgres to BigQuery with the ETL and ELT
paradigms. Then, we share some challenges you may encounter when
attempting to load data incrementally with Airflow DAGs. Finally, we
argue why Airflow ETL operators won't be able to cover the long tail
of integrations for your business data.

The suggested alternative is to keep using Airflow to schedule and
monitor ETL/ELT pipelines, but use other open-source projects that
are better suited for the extract, transform and load steps. Notably,
using Airbyte for the EL steps and dbt for the T step. The good news
is that you can easily integrate Airflow with Airbyte and dbt.

 The Good

Airflow ETL Operators

Imagine you have your application data stored in a Postgres database.
The marketing, sales and product teams have their data stored on
third-party applications such as Google Ads, Salesforce and Segment.
From this point, you may want to centralize all your business data in
a single data warehouse, such as Google BigQuery.

Your company is already using Airflow, so you start searching for
Airflow ETL operators to extract, transform and load data within the
built-in operators and provider packages. When you find no operator
to interact with your data you may search for Airflow Hooks to
connect with an external system. For example, there is no operator to
retrieve data from Segment, but you can use the SegmentHook to
interact with Segment's API.

At first, you may be tempted to build an ETL pipeline where you
extract your data from Postgres to a file storage, transform the data
locally with the PythonOperator and then load the transformed data to
BigQuery. You end up creating the following example of an Airflow
ETL DAG.

PostgresToGCSOperator (E) -> GCSToLocalFilesystemOperator (E) ->
PythonOperator (T) -> LocalFilesystemToGCSOperator (L) ->
GCSToBigQueryOperator (L)

But is this a good practice today?

This follows the traditional ETL paradigm where the transformation
logic happens between the extract and load steps. With Airflow you
can use operators to transform data locally (PythonOperator,
BashOperator...), remotely (SparkSubmitOperator...) or in a data store
(PostgresOperator, BigQueryInsertJobOperator...).

Over the last few years, many data teams have migrated their ETLs to
follow the ELT paradigm. With the ELT paradigm, raw data is first
loaded into the data warehouse and then transformed with SQL in the
data warehouse. One of the main issues of ETL pipelines is that they
transform data in transit, so they break easier. 

On the other hand, today's data warehouses autoscale to handle any
computation and volume needs. Modern data transformation tools like
dbt have given SQL superpowers. With dbt you can use macros, for
loops and more in your SQL files. On top of that, dbt handles
dependencies between tables based on references on the SQL code. When
you transform data with Airflow you need to duplicate the
dependencies between tables both in your SQL files and in your DAG.

    SQL is taking over Python to transform and analyze data in the
    modern data stack

Airflow ELT Operators

You can use Airflow transfer operators together with database
operators for ELT pipelines. There is no Airflow operator to transfer
data directly from Postgres to BigQuery so you need to use a staging
storage in Google Cloud. For example, you can use the
PostgresToGCSOperator followed by the GCSToBigQueryOperator. Then,
you can apply transformations in the destination with BigQuery
operators.

PostgresToGCSOperator (E)->  GCSToBigQueryOperator (L) ->
BigQueryInsertJobOperator (T)

Airflow provides a vast number of choices to move data from one
system to another. The thing is, most often the more choices you
have, the more chances that you make mistakes.

    Airflow is not an opinionated ETL tool.

 The Bad

As we have seen, doing a full data replication with Airflow is easy.
But what happens when your schema on the data source changes and you
try to rerun the full replication DAG? It will fail. To be able to
rerun the DAG, you need to add an extra task to delete the data in
the destination before trying to replicate the data again. This also
ensures that your DAGs are idempotent and you can rerun them without
any side effects when they fail. Next, we will see how it gets more
complicated when you want to create a DAG to replicate data
incrementally.

Loading Data Incrementally with Airflow

If you want to load data incrementally with Airflow, you first need
to decide on the DAG frequency. Then you need to make sure that each
DAG run only loads data that was created between the DAG execution
date and the next DAG execution date. For that, you have to modify
the SQL files used by the PostgresToGCSOperator to include Airflow
macros `{{ds}}` for the DAG execution date and `{{next_ds}}` for the
next execution date. The values are injected by Airflow during
runtime.

Another issue that you may run into, is that you may need a different
DAG for full data replication and incremental replication if your DAG
frequency is high. Imagine that the ideal frequency of your
incremental replication is hourly and you have business data for the
last five years. That means that if you want to reload data from the
beginning, Airflow will trigger 43800 (5 * 365 * 24) Airflow DAG runs
which can be slow as Airflow introduces a significant latency between
two DAG runs. To solve this problem, many data engineering teams end
up creating monthly, daily and hourly DAGs with nearly the same code.

As we have seen, there is no convention over how to set up a full
replication and an incremental replication for each transfer
operator. On top of that, each transfer operator has a different
interface, sync modes have different names and how you map types from
source to destination varies from one operator to another. Ideally,
with an ETL tool you would expect to only select the source,
destination, data replication mode and the database column(s) to look
for new data.

Other open-source data integration tools like Airbyte have a set of
common options for all data transfers. For example, Airbyte supports
four modes to sync data: full refresh overwrite, full refresh append,
incremental append and incremental deduped history.

 The Ugly

Sources and Destinations are Tightly Coupled in Airflow

Each cloud provider includes operators to transfer data for the most
popular systems in their cloud. In the providers packages
documentation you can find a list of operators and hooks that are
released independently of the Airflow core. You can navigate all 60
transfer operators in a Google Sheet organized by source and
destination. Below you can see a Chord diagram with all possible data
flows using Airflow transfer operators. 


[615f1d3c9b]
Visualizing all data flows using Airflow transfer operators. Credits
to the Datasmith Chord Diagram Generator.


Notice that the graph above doesn't show the direction of a data
transfer. Only in some cases, you can move data back and forth like
with the GCSToS3Operator and S3ToGCSOperator operators. Sometimes you
can move data within the same system, like with the GCSToGCSOperator.
Now imagine that you want to replicate Google Ads data to Snowflake.
You have to use the GoogleAdsToGcsOperator, followed by the
GCSToS3Operator and the S3ToSnowflakeOperator.

The main issue with Airflow transfer operators is that if you want to
support transfers from M sources to N destinations, the community
would need to code N x M Airflow operators. Airflow data transfer
operators are focused on a limited number of databases, data
warehouses and data lakes. But what happens when your company needs
to synchronize data from business applications? There are no Airflow
operators to transfer data from business applications other than the
Salesforce, Google Ads and Facebook Ads operators.

Other data integration tools like Airbyte have decoupled sources from
destinations with connectors, so that the community only needs to
code 2 * (N + M) connectors to support all possible data transfers.
After only one year, Airbyte already supports hundreds of connectors
and more than 1000 Airbyte data transfers vs 60 supported data
transfers with Airflow.

    Airflow transfer operators won't be able to cover the long tail
    of integrations

 The Alternative

ELT Pipelines with Airflow, Airbyte and dbt

Airflow, Airbyte and dbt are three open-source projects with a
different focus but lots of overlapping features. Originally, Airflow
is a workflow management tool, Airbyte a data integration (EL steps)
tool and dbt is a transformation (T step) tool.

As we have seen, you can also use Airflow for ETL/ELT pipelines. The
feature overlapping doesn't stop here, it also works the other way
around. Airbyte Open-Source  and Airbyte Cloud also provide a
scheduler and integrate with dbt for the T step. dbt Cloud also comes
with a scheduler.


[615f1d3c38]
Airbyte Connection UI to replicate data from Postgres to BigQuery and
apply custom dbt transformations.



Where to Schedule ETL/ELT Pipelines?

If you are a data company and have different kinds of pipelines, you
can use Airflow as your scheduler for all tasks, including ETL/ELT
pipelines. You can then integrate Airflow with Airbyte for the EL
steps with the AirbyteTriggerSyncOperator. This way, you can trigger
incremental syncs with the Airflow scheduler and run full refresh
syncs from the Airbyte UI without the latency you would have when
doing an Airflow backfill job.

You can also integrate Airbyte with dbt to run transformations for
each single connection that loads data to destination. Keep in mind
that you should limit Airbyte usage of dbt to simple data
normalization for each loaded table. If you need more complex data
transformation logic that combines data from different dbt models,
you can integrate Airflow with dbt instead.

Unfortunately, the community has not yet agreed about the best way to
integrate Airflow and dbt. Some use the BashOperator, others the
airflow-dbt integration package contributed by the GoCardless team
and a few others use the dbt-cloud-plugin. The engineering team at
Updater has shared their work recreating the dbt graph in Airflow by
parsing the dbt manifest.json file.


[615f1d3cd0]
Airflow DAG parsed from the dbt manifest.json file. Credits to the
Updater and Astronomer.io teams.

Conclusion

Airflow shines as a workflow orchestrator. Because Airflow is widely
adopted, many data teams also use Airflow transfer and transformation
operators to schedule and author their ETL and ELT pipelines. We have
seen some of the challenges of building full refresh and incremental
data replication DAGs with Airflow. More troublesome is that sources
and destinations are tightly coupled in Airflow transfer operators.
Because of this, it will be hard for Airflow to cover the long-tail
of integrations for your business applications.

One alternative is to keep using Airflow as a scheduler and integrate
it with two other open-source projects that are better suited for ELT
pipelines, Airbyte for the EL parts and dbt for the T part. Airbyte
sources are decoupled from destinations so you can already sync data
from 100+ sources (databases,  APIs, ...) to 10+ destinations
(databases, data warehouses, data lakes, ...) and remove boilerplate
code needed with Airflow. With dbt you can transform data with SQL in
your data warehouse and avoid having to handle dependencies between
tables in your Airflow DAGs.

If you would like to get involved, we hope you'll join Airbyte's
Slack community, the most active community around data integration!

Getting started is easy

Start breaking your data siloes with Airbyte.

Get Started 
[60c383aaf1]
Check our Github
[60c8ed00f9]

Want to follow us?

Subscribe to get updates from our community and on our progress on
making data integration a commodity!

[                    ][Subscribe]
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Continue reading

[60dd3df752]
 
Data Engineering
July 1, 2021
7 min read

Why ETL Needs Open Source to Address the Long Tail of Integrations

[60dd3e2e48]
Shrif Nada
[6094583336]
 
Data Engineering
January 21, 2021
6 min read

How Open-source Can Disrupt Build vs. Buy Considerations

[60dd3f1393]
John Lafleur
[6094599c95]
 
Data Engineering
November 23, 2020
8 min read

How We Leveraged Singer for Our MVP

[60dd3f426b]
Charles Giardina
[60f73e6325]
[60c37e7983766f7c2]

Airbyte is an open-source data integration engine that helps you
consolidate your data in your data warehouses, lakes and databases.

[60c383aaf1][60c383aa92][60c383aab9][60c383aa92][60c383aaeb]

(c) 2021 Airbyte, Inc.

Product

  * Features
  * Connectors
  * Connector Development Kit (CDK)
  * Demo App
  * Pricing
  * Roadmap
  * Compare Airbyte
  * GitHub
  * Weekly Office Hours
  * Community call
  * Connector contest

Resources

  * Documentation
  * API Docs (alpha)
  * Changelog
  * Contributors
  * Community
  * Connector Marketplace
  * Discourse
  * Write for the community
  * Blog
  * Recipes
  * Case studies
  * Newsletter

Company

  * Our Handbook
  * About Us
  * Careers
  * Partners
  * Press
  * Terms of Service
  * Privacy Policy
  * Contact Us

[60ec2eb9446e3dfd408a23af_airbyte-community-without-back]
[60ec2965cfd091820f6cc74b_Frame]
Slack logo

Get answers quick on
Airbyte Slack

Hi there! Did you know our Slack is the most active Slack community
on data integration? It's also the easiest way to get help from our
vibrant community.

Join Airbyte SlackI'm not interested in joining