Content-addressable storage

In Git, every file is stored under a hash of its content. One consequence is that a particular file is only ever stored once in the repository, regardless of how many versions it appears in and under how many names. Each of those instances is a pointer to the same blob of data.

I'm using Git to manage a large collection of binary files (scanned images). I'm using version control so I can add new data, replace bad scans, and rename or reorder files reversibly. I've noticed two great things about this:

First, since I frequently add new files and rename files, but only occasionally delete or modify them, the repository (containing the entire revision history) is only slightly larger than the actual current state of the repository. This is true for many kinds of binary data (e.g. photos, music), so it makes using Git very attractive: why would you keep one backup copy when, for about the same amount of space, you could have a complete version history?

Second, pushing changes is very fast. Even after I rename hundreds of files, Git doesn't need to push huge amounts of binary data (just a few kilobytes) because every file is already in the repository, except possibly under a different name. This is far more economical than rsync, which would attempt to re-transfer every file that had been changed.

No comments:

Post a Comment