Building Data Applications

Let’s explore popular open-source libraries to build data applications and pipelines.

alpha2phi
5 min readDec 26, 2021

--

Photo by Luke Chesser on Unsplash

Let’s explore open-source libraries that we can use to build our ELT (Extract, Load, Transform) or ETL (Extract, Transform, Load) pipelines.

Airflow

Let’s start with Apache Airflow. With > 24k GitHub Star, Airflow is definitely the most mature and used open-source library to build data pipelines and workflows.

Apache Airflow

Airflow provides an end-to-end solution to programmatically author, schedule, and monitor your workflows.

Using Airflow, a workflow is authored as a DAG (Directed Acyclic Graph), which is a collection of tasks with directional dependencies.

  • A DAG also has a schedule, a start date, and an end date
    (optional).
  • For each schedule, (say daily or hourly), the DAG needs to run
    each individual task as their dependencies are met. Certain tasks have
    the property of depending on their own past, meaning that they can’t run
    until their previous schedule (and upstream tasks) are completed.

Read the project focus to understand more on the right use case for Airflow. It works best with workflows that are mostly static and slowly changing.

dbt (Data Build Tool)

Rather than using a DAG and Python to programmatically author a workflow, dbt takes a different approach.

Using a SQL-based approach, dbt can transform your data by simply writing select statements, while dbt handles turning these statements into tables and views in a data warehouse.

Using it together with Jinja, you can turn your dbt project into a programming environment for SQL, giving you the ability to do things that aren’t normally possible in SQL.

An example is taken from the documentation.

{% set payment_methods =…

--

--

alpha2phi

Software engineer, Data Science and ML practitioner.