Building Data Applications
Let’s explore popular open-source libraries to build data applications and pipelines.
Let’s explore open-source libraries that we can use to build our ELT (Extract, Load, Transform) or ETL (Extract, Transform, Load) pipelines.
Airflow
Let’s start with Apache Airflow. With > 24k GitHub Star, Airflow is definitely the most mature and used open-source library to build data pipelines and workflows.
Airflow provides an end-to-end solution to programmatically author, schedule, and monitor your workflows.
Using Airflow, a workflow is authored as a DAG (Directed Acyclic Graph), which is a collection of tasks with directional dependencies.
- A DAG also has a schedule, a start date, and an end date
(optional). - For each schedule, (say daily or hourly), the DAG needs to run
each individual task as their dependencies are met. Certain tasks have
the property of depending on their own past, meaning that they can’t run
until their previous schedule (and upstream tasks) are completed.
Read the project focus to understand more on the right use case for Airflow. It works best with workflows that are mostly static and slowly changing.
- Current GitHub Start: > 24k
- Programming Languages: Python
- Sample Pipelines
dbt (Data Build Tool)
Rather than using a DAG and Python to programmatically author a workflow, dbt takes a different approach.
Using a SQL-based approach, dbt can transform your data by simply writing select statements, while dbt handles turning these statements into tables and views in a data warehouse.
Using it together with Jinja, you can turn your dbt project into a programming environment for SQL, giving you the ability to do things that aren’t normally possible in SQL.
An example is taken from the documentation.
{% set payment_methods =…