Member-only story
Python — Generic Data Ingestion Framework
For any projects, be it software or data science related projects, there is always a need to load data from files into databases, which can be relational databases or big data storages like Hive or Impala.
Flat files can come in different formats, e.g. command separated (CSV), tab delimited or in binary format like Excel. Also there will be different types of data you need to load, e.g. account, customer, sales, etc and each one of them will have different fields/columns and data types that you need to handle differently during data loading.
In this article I will show you a simple way to have a generic data loading framework by using sqlalchemy and pandas. I am using Excel and Postgresql as examples but it can be easily extended to other databases like MySQL, Hive/Impala, etc as long as the file types are supported by pandas and the databases are supported by sqlalchemy.
Docker Setup
Firstly, let’s set up postgresql database with a table for ingestion purpose.
version: '3'
services:
pgsql:
image: "postgres" # use latest official postgres version
env_file:
- db.env # configure postgres
volumes:
- database-data:/var/lib/postgresql/data/ # persist data even if container shuts down
- ./sql/init-pgsql.sql:/docker-entrypoint-initdb.d/init.sql
ports:
- "5432:5432"
volumes:
database-data: # named volumes can be managed easier using docker-compose
Credentials for this postgresql database instance is in the .env environment file.
POSTGRES_USER=user1
POSTGRES_PASSWORD=userpwd
POSTGRES_DB=testdb
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
When the docker container runs for the first time, it also runs the init.sql to create the accounts table.
CREATE TABLE accounts (
user_id int PRIMARY KEY,
username VARCHAR ( 50 ) UNIQUE NOT NULL,
password VARCHAR ( 50 ) NOT NULL,
email VARCHAR ( 255 ) UNIQUE NOT NULL,
created_on TIMESTAMP NOT NULL,
last_login TIMESTAMP
);
This table contains different data types like integer, varchar and timestamp. Later we will see how to handle different data types as part of data loading using sqlalchemy.