Hiding your large datasets from Github
A quick & dirty hack for working with large datasets on a local repo
Say you have a git repo on Github: my_datascience
you keep it updated with your local:
> git clone git@github.com:user/my_datascience.git… do some work …> git add my_new_file
> git add my_script.py
> git commit -m 'made some changes'
> git push
> git pull [origin master]
All of this works!
But say you want to work with a giant datafile that exceeds Github’s storage limits (“GitHub will warn you when pushing files larger than 50 MB. You will not be allowed to push files larger than 100 MB.”), e.g. giant_data.csv
.
You can add the file to your directory and never run git add giant_data.csv
, but git status
will constantly bleat at you and god forbid you happen to git add .
!
You can also store your file in a directory outside of your repo, and you’ll never have to worry about it getting pushed to Github. But what if you want to locally version-control your data file alongside your work?
One easy workaround is to store your giant data in a branch you don’t set a remote for. Let’s say you have downloaded giant_data.zip
into your ~/Downloads
folder (which is outside of your repo).
> git checkout -b bigdata
> mv ~/Downloads/giant_data.csv ./
> git add giant_data.csv
> git commit -m 'added giant data to bigdata branch'
Now you can work with the giant_data.zip
file while in the bigdata
branch without worrying about it reaching Github or interfering with a pull
, as long as you don’t add that branch.
Say you’ve made some changes to my_script.py
that you want to push up to Github. No problem! Just go back to the master
branch and checkout
the file from the bigdata
branch.
> git branch
* bigdata
master
> git add my_script.py
> git commit -m 'my script works with giant data'
> git checkout master
> git checkout bigdata my_script.py
> git add my_script.py
> git commit -m 'adding changes from bigdata branch'
> git push
There are additional things you can do to prevent your bigdata
branch from being pushed.
Note: this is not actually how you should do this; you’d want a lot more features and functionality if you were to do this with any regularity. And you still have no way to share your big data file via Github.
Just as importantly, git
works on line-by-line diffs, which works well for text files but terribly for binary files. If your giant file was giant_data.zip
, git
would be quite unhappy with you if you expected it to keep track of changes in the file.
So you should really be using Git LFS or git-annex, which does this branch hiding robustly.
This exercise should simply give you some understanding of a little bit of the process git-annex
uses under the hood. (Git LFS does not generate its own tracking branch.) In addition to taking advantage of the power of git branches, git-annex and Git LFS generate and track file hashes as pointers to the large files, use git’s smudge
/clean
filters, and much more.