Data Version Control for Machine Learning

Explore GitHub Large File Storage and DVC to manage large datasets.

alpha2phi
5 min readJul 30, 2024
Photo by Pietro Jeng on Unsplash

Getting Started

Regardless of your role as a software engineer, data engineer, data analyst, or data scientist, we nowadays have to deal with large files such as images, audio, video, and text. With machine learning, it is getting more complicated as we need to ensure reproducible workflow and results for the same version of the code and dataset.

This article explores Git Large File Storage (LFS) and delves into Data Version Control (DVC).

GitHub Limitations

By default, GitHub limits the size of files we can track in regular Git repositories.

If we attempt to add or update a file larger than 50 MiB, we will receive a warning from Git. The changes will still successfully get pushed to the repository, but we should consider removing the commit to minimize performance impact.

GitHub blocks files larger than 100 MiB.

Ideally, GitHub recommends repositories remain small, ideally less than 1 GB and less than 5 GB are strongly recommended.

To remove a large file added in the most recent unpushed commit, we can use the following commands.

--

--

alpha2phi

Software engineer, Data Science and ML practitioner.