Building Data Applications

Let’s explore popular open-source libraries to build data applications and pipelines.

alpha2phi
5 min readDec 26, 2021
Photo by Luke Chesser on Unsplash

Let’s explore open-source libraries that we can use to build our ELT (Extract, Load, Transform) or ETL (Extract, Transform, Load) pipelines.

Airflow

Let’s start with Apache Airflow. With > 24k GitHub Star, Airflow is definitely the most mature and used open-source library to build data pipelines and workflows.

Apache Airflow

Airflow provides an end-to-end solution to programmatically author, schedule, and monitor your workflows.

Using Airflow, a workflow is authored as a DAG (Directed Acyclic Graph), which is a collection of tasks with directional dependencies.

  • A DAG also has a schedule, a start date, and an end date
    (optional).
  • For each schedule, (say daily or hourly), the DAG needs to run
    each individual task as their dependencies are met. Certain tasks have
    the property of depending on their own past, meaning that they can’t run
    until their previous schedule (and upstream tasks) are completed.

--

--

alpha2phi

Software engineer, Data Science and ML practitioner.