looking into git-annex
Tags: [getting things right]
Published: 28 Oct 2015 22:05

I care a lot about not losing files, enough to think about a checklist to follow to make sure that files are preserved.

Recently I lost some files due to an errant dd. I missed adding make sure that your backup system is incremental and automatic to the checklist.

My current method of manually rsyncing files and directories from content creators to content stores is not good enough. I’ve now lost files because the process is manual and faulty.

On top of this, file disorganization is still a major issue. I’m vaguely aware of where things are, but only vaguely, and the presence of different versions through time of files and directories is confusing.

enter git-annex

I’ve been looking into git-annex for a while now as one method of organizing my files:

git-annex allows managing files with git, without checking the file contents into git.

For me, using git-annex is better than manually rsyncing files and directories around because it tracks where the files are (see the page on location tracking). If I originally used git-annex for my directories I would not have the problem I do now of different versions of directories through time. I could simply have each repository target the same remote, and merge changes from each into one. I could then feel confident in deleting what I know are duplicate directories because the changes have been preserved.

However, git-annex is not a backup system. Backups have to be able to survive erroreous deletion or something like the Cryptolocker virus. ZFS snapshots on my fileservers provide this functionality so git-annex will have to store data there.

git-annex defaults seem sane. It hangs on to every old version of a file just like git. The content of files in the annex is prevented from being modified by making it read-only. This prevents normal deletion (not dd deletion), and causes one to think about annexed files differently. Those files could be changed by unlocking them but some sets of files should be read only. The only thing I am going to change is the default numcopies: from 1 to 3, following the 3-2-1 rule.

from scratch

I tried looking for others who have gone from not using git-annex at all to finally organizing everything using it. One of the posts I could find was at endot.org: the author seems to have had the exact same problem as I do now:

Unfortunately, I’m not very organized. When I encounter data that I want to keep, I usually rsync it onto one or another external drive or server. However, since the data is not organized, I can’t tell how much of it can simply be deleted instead of backed up again. The actual amount of data that should be backed up is probably less than half of the amount of data that exists on the various internal and external drives both at home and at work. This also means that most of my hard drives are at 90% capacity and I don’t know what I can safely delete. (from “Managing backups with git-annex”)

Using git-annex seems to have worked - endot.org also has a page on git-annex tips which mentions that they have organized their files into 9 large repositories over the course of a year.

There is also a git-annex page called centralized repository: starting from nothing that sounds promising but doesn’t include tips for organizing files before they get annexed. This is important because git-annex hangs on to old files and I would rather that it didn’t hang on to duplicates. There are ways of removing files from the annex but it is arduous.

I’ll have to do this one step at a time.