Building Data Applications
Let’s explore popular open-source libraries to build data applications and pipelines.
Let’s explore open-source libraries that we can use to build our ELT (Extract, Load, Transform) or ETL (Extract, Transform, Load) pipelines.
Airflow
Let’s start with Apache Airflow. With > 24k GitHub Star, Airflow is definitely the most mature and used open-source library to build data pipelines and workflows.
Airflow provides an end-to-end solution to programmatically author, schedule, and monitor your workflows.
Using Airflow, a workflow is authored as a DAG (Directed Acyclic Graph), which is a collection of tasks with directional dependencies.
- A DAG also has a schedule, a start date, and an end date
(optional). - For each schedule, (say daily or hourly), the DAG needs to run
each individual task as their dependencies are met. Certain tasks have
the property of depending on their own past, meaning that they can’t run
until their previous schedule (and upstream tasks) are completed.