Member-only story

Data Version Control for Machine Learning

Explore GitHub Large File Storage and DVC to manage large datasets.

--

Photo by Pietro Jeng on Unsplash

Getting Started

Regardless of your role as a software engineer, data engineer, data analyst, or data scientist, we nowadays have to deal with large files such as images, audio, video, and text. With machine learning, it is getting more complicated as we need to ensure reproducible workflow and results for the same version of the code and dataset.

This article explores Git Large File Storage (LFS) and delves into Data Version Control (DVC).

GitHub Limitations

By default, GitHub limits the size of files we can track in regular Git repositories.

If we attempt to add or update a file larger than 50 MiB, we will receive a warning from Git. The changes will still successfully get pushed to the repository, but we should consider removing the commit to minimize performance impact.

GitHub blocks files larger than 100 MiB.

Ideally, GitHub recommends repositories remain small, ideally less than 1 GB and less than 5 GB are strongly recommended.

To remove a large file added in the most recent unpushed commit, we can use the following commands.

$ git rm --cached GIANT_FILE
$ git commit --amend -CHEAD
$ git push

To remove a file from an earlier commit, we utilize the git-filter-repo utility.

Git Large File Storage (LFS)

To track files beyond the 100 MiB limit, we use the Git Large File Storage (LFS).

Git LFS handles large files by storing references to the file in the repository, but not the actual file itself.

Different maximum size limits for Git LFS apply depending on our GitHub plan. For the Free and Pro plans the limit is 2 GB.

Let’s try out an example application. For this application, we have a dataset larger than 100 Mib under the dataset folder.

git-lfs

We install git-lfs which is an open-source Git extension for versioning large files.

--

--

alpha2phi
alpha2phi

Written by alpha2phi

Software engineer, Data Science and ML practitioner.

No responses yet