Hiding your large datasets from Github

Brad Johnson
3 min readOct 15, 2019

--

A quick & dirty hack for working with large datasets on a local repo

Say you have a git repo on Github: my_datascience

you keep it updated with your local:

> git clone git@github.com:user/my_datascience.git… do some work …> git add my_new_file
> git add my_script.py
> git commit -m 'made some changes'
> git push
> git pull [origin master]

All of this works!

But say you want to work with a giant datafile that exceeds Github’s storage limits (“GitHub will warn you when pushing files larger than 50 MB. You will not be allowed to push files larger than 100 MB.”), e.g. giant_data.csv.

You can add the file to your directory and never run git add giant_data.csv, but git status will constantly bleat at you and god forbid you happen to git add .!

You can also store your file in a directory outside of your repo, and you’ll never have to worry about it getting pushed to Github. But what if you want to locally version-control your data file alongside your work?

One easy workaround is to store your giant data in a branch you don’t set a remote for. Let’s say you have downloaded giant_data.zip into your ~/Downloads folder (which is outside of your repo).

> git checkout -b bigdata
> mv ~/Downloads/giant_data.csv ./
> git add giant_data.csv
> git commit -m 'added giant data to bigdata branch'

Now you can work with the giant_data.zip file while in the bigdata branch without worrying about it reaching Github or interfering with a pull, as long as you don’t add that branch.

Say you’ve made some changes to my_script.py that you want to push up to Github. No problem! Just go back to the master branch and checkout the file from the bigdata branch.

> git branch 
* bigdata
master
> git add my_script.py
> git commit -m 'my script works with giant data'
> git checkout master
> git checkout bigdata my_script.py
> git add my_script.py
> git commit -m 'adding changes from bigdata branch'
> git push

There are additional things you can do to prevent your bigdata branch from being pushed.

Note: this is not actually how you should do this; you’d want a lot more features and functionality if you were to do this with any regularity. And you still have no way to share your big data file via Github.

Just as importantly, git works on line-by-line diffs, which works well for text files but terribly for binary files. If your giant file was giant_data.zip, git would be quite unhappy with you if you expected it to keep track of changes in the file.

So you should really be using Git LFS or git-annex, which does this branch hiding robustly.

This exercise should simply give you some understanding of a little bit of the process git-annex uses under the hood. (Git LFS does not generate its own tracking branch.) In addition to taking advantage of the power of git branches, git-annex and Git LFS generate and track file hashes as pointers to the large files, use git’s smudge /clean filters, and much more.

--

--

Brad Johnson
Brad Johnson

Written by Brad Johnson

Climate strategist, HillHeat.News. Former Climate Hawks Vote ED, Campaign Manager for Forecast the Facts, ThinkProgress Green Editor.

No responses yet