Data Version Control for Machine Learning
Explore GitHub Large File Storage and DVC to manage large datasets.
Getting Started
Regardless of your role as a software engineer, data engineer, data analyst, or data scientist, we nowadays have to deal with large files such as images, audio, video, and text. With machine learning, it is getting more complicated as we need to ensure reproducible workflow and results for the same version of the code and dataset.
This article explores Git Large File Storage (LFS) and delves into Data Version Control (DVC).
GitHub Limitations
By default, GitHub limits the size of files we can track in regular Git repositories.
If we attempt to add or update a file larger than 50 MiB, we will receive a warning from Git. The changes will still successfully get pushed to the repository, but we should consider removing the commit to minimize performance impact.
GitHub blocks files larger than 100 MiB.
Ideally, GitHub recommends repositories remain small, ideally less than 1 GB and less than 5 GB are strongly recommended.
To remove a large file added in the most recent unpushed commit, we can use the following commands.