summaryrefslogtreecommitdiff
path: root/doc/git-annex.mdwn
blob: 4647eb058789ffe79d395d19c2a7a94b70d615f3 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
git-annex allows managing files with git, without checking the file
contents into git. While that may seem paradoxical, it is useful when
dealing with files larger than git can currently easily handle, whether due
to limitations in memory, checksumming time, or disk space.

Even without file content tracking, being able to manage files with git,
move files around and delete files with versioned directory trees, and use
branches and distributed clones, are all very handy reasons to use git. And
annexed files can co-exist in the same git repository with regularly
versioned files, which is convenient for maintaining documents, Makefiles,
etc that are associated with annexed files but that benefit from full
revision control.

My motivation for git-annex was the growing number of external drives I
use. Some are used to archive data, others hold backups, and yet others
come with me when I'm away from home to carry data that doesn't fit on my
netbook. Maintaining all that was a nightmare, lots of ad-hoc moving files
around, rsyncing files (unison is too slow), and deleting multiple copies
of files from multiple places. I realized what what I needed was a form of
revision control where each drive was a repository, and where copying the
files around, and deciding which copies were safe to delete was automated.
I posted about this to the VCS-home mailing list and got a great suggestion
to make it support arbitrary key-value stores, for more generality and
flexability.  A week of coding later, and git-annex is born.

Enough broad picture, here's how it actually looks:

* `git annex add $file` moves the file into `.git/annex/`, and replaces
  it with a symlink pointing at the annexed file, and then calls `git add`
  to version the *symlink*. (If the file has already been annexed, it does
  nothing.) 
  
  If you then use normal git push/pull commands, the annexed file content
  won't be transferred between repositories, but the symlinks will be.
  So different clones of a repository can have different sets of annexed
  files available.
  
  You can move the symlink around, copy it, delete it, etc, and commit changes
  as desired using git. Reading the symlink will always get you the annexed
  file content, or the link may be broken if the content is not currently
  available.
* `git annex get $file` is used to transfer a specified file from the
  backend storage to the current repository.
* `git annex drop $file` indicates that you no longer want the file's
  content to be available in this repository.
* `git annex file $file` adjusts the symlink for the file to point to its
  content again. Use this if you've moved the file around.
* `git annex unannex $file` undoes a `git annex add`. But use `git annex drop`
  if you're just done with a file; only use `unannex` if you
  accidentially added a file. (You can also run this on all your annexed
  files come the Singularity. ;-)
* `git annex init "some description"` allows associating some description
  (such as "USB archive drive 1") with a repository. This can help with
  finding it later, see "Location Tracking" below.

Oh yeah, "$file" in the above can be any number of files, or directories,
same as you'd pass to "git add" or "git rm".
So "git annex add ." or "git annex get dir/" work fine.

## key-value storage

git-annex uses a key-value abstraction layer to allow file contents to be
stored in different ways. In theory, any key-value storage system could be
used to store the file contents, and git-annex would then retrieve them
as needed and put them in `.git/annex/`.

When a file is annexed, a key is generated from its content and/or metadata.
The file checked into git symlinks to the key. This key can later be used
to retrieve the file's content (its value). This key generation must be
stable for a given file content, name, and size.

Multiple pluggable backends are supported, and more than one can be used
to store different files' contents in a given repository.

* `WORM` ("Write Once, Read Many") This backend stores the file's content
  only in `.git/annex/`, and assumes that any file with the same basename,
  size, and modification time has the same content. So with this backend,
  files can be moved around, but should never be added to or changed.
  This is the default, and the least expensive backend.
* `SHA1` -- This backend stores the file's content in
  `.git/annex/`, with a name based on its sha1 checksum. This backend allows
  modifications of files to be tracked. Its need to generate checksums
  can make it slow for large files.
* `URL` -- This backend downloads the file's content from an external URL.

## copies

The WORM and SHA1 key-value backends store data inside your git repository.
It's important that data not get lost by an ill-though `git annex drop`
command.  So, then using those backends, git-annex can be configured to try
to keep N copies of a file's content available across all repositories. By
default, N is 1; it is configured by annex.numcopies.

`git annex drop` attempts to check with other git remotes, to check that N
copies of the file exist. If enough repositories cannot be verified to have
it, it will retain the file content to avoid data loss.

For example, consider three repositories: Server, Laptop, and USB. Both Server
and USB have a copy of a file, and N=1. If on Laptop, you `git annex get
$file`, this will transfer it from either Server or USB (depending on which
is available), and there are now 3 copies of the file.

Suppose you want to free up space on Laptop again, and you `git annex drop` the file
there. If USB is connected, or Server can be contacted, git-annex can check
that it still has a copy of the file, and the content is removed from
Laptop. But if USB is currently disconnected, and Server also cannot be
contacted, it can't verify that it is safe to drop the file, and will
refuse to do so.

With N=2, in order to drop the file content from Laptop, it would need access
to both USB and Server.

Note that different repositories can be configured with different values of
N. So just because Laptop has N=2, this does not prevent the number of
copies falling to 1, when USB and Server have N=1.

## location tracking

git-annex keeps track of in which repositories it last saw a file's content.
This location tracking information is stored in `.git-annex/$key.log`.
Repositories record their UUID and the date when they get or drop 
a file's content. (Git is configured to use a union merge for this file,
so the lines may be in arbitrary order, but it will never conflict.)

This location tracking information is useful if you have multiple
repositories, and not all are always accessible. For example, perhaps one
is on a home file server, and you are away from home. Then git-annex can
tell you what git remote it needs access to in order to get a file:

	# git annex get myfile
	get myfile (need access to one of these remotes: home)
	git-annex: get myfile failed

Another way the location tracking comes in handy is if you put repositories
on removable USB drives, that might be archived away offline in a safe
place. In this sort of case, you probably don't have a git remotes
configured for every USB drive. So git-annex may have to resort to talking
about repository UUIDs. If you have previously used "git annex init"
to attach descriptions to those repositories, it will include their
descriptions to help you with finding them:

	# git annex get myfile
	get myfile (No available git remotes have the file.)
	  It has been seen before in these repositories:
	  	c0a28e06-d7ef-11df-885c-775af44f8882  -- USB archive drive 1
	  	e1938fee-d95b-11df-96cc-002170d25c55
	git-annex: get myfile failed

## configuration

* `annex.uuid` -- a unique UUID for this repository
* `annex.numcopies` -- number of copies of files to keep across all
  repositories (default: 1)
* `annex.backends` -- space-separated list of names of 
  the key-value backends to use. The first listed is used to store
  new files. (default: "WORM SHA1 URL")
* `remote.<name>.annex-cost` -- When determining which repository to
  transfer annexed files from or to, ones with lower costs are preferred.
  The default cost is 100 for local repositories, and 200 for remote
  repositories. Note that other factors may be configured when pushing
  files to repositories, in particular, whether the repository is on
  a filesystem with sufficient free space.
* `remote.<name>.annex-uuid` -- git-annex caches UUIDs of repositories
  here.

## issues

### symlinks

If the symlink to annexed content is relative, moving it to a subdir will
break it. But it it's absolute, moving the git repo (or mounting its drive
elsewhere) will break it. Either:

* Use relative links and need `git annex mv` to move (or post-commit 
  hook that caches moves and updates links).
* Use absolute links and need `git annex fixlinks` when location changes;
  note that would also mean that git would see the symlink targets changed
  and want to commit the change. And, other clones of the repo would
  diverge and there would be conflicts on the symlink text. Ugh.

Hard links are not an option, because git would then happily commit the
file content. Amoung other reasons..

### free space determination

Need a way to tell how much free space is available on the disk containing
a given repository. The repository may be remote, so ssh may need to be
used.

Similarly, need a way to tell the size of a file before copying it from
a remote, to check local disk space.

### auto-drop on rm

When git-rm removed a file, its key should get dropped too. Of course, it
may not be dropped right away, depending on number of copies available.

### branching

The use of `.git-annex` to store logs means that if a repo has branches 
and the user switched between them, git-annex will see different logs in
the different branches, and so may miss info about what remotes have which
files (though it can re-learn). An alternative would be to
store the log data directly in the git repo as `pristine-tar` does.

## contact

Joey Hess <joey@kitenet.net>