summaryrefslogtreecommitdiff
path: root/doc/walkthrough.mdwn
blob: 231c3e543cd98b92954a28e7206c73c86deb875d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
A walkthrough of the basic features of git-annex.

[[!toc]]

## creating a repository

This is very straightforward. Just tell it a description of the repository.

	# mkdir ~/annex
	# cd ~/annex
	# git init
	# git annex init "my laptop"

## adding a remote

Like any other git repository, git-annex repositories have remotes.
Let's start by adding a USB drive as a remote.

	# sudo mount /media/usb
	# cd /media/usb
	# git clone ~/annex
	# cd annex
	# git annex init "portable USB drive"
	# git remote add laptop ~/annex
	# cd ~/annex
	# git remote add usbdrive /media/usb

This is all standard ad-hoc distributed git repository setup.
The only git-annex specific part is telling it the name
of the new repository created on the USB drive.

Notice that both repos are set up as remotes of one another. This lets
either get annexed files from the other. You'll want to do that even
if you are using git in a more centralized fashion.

## adding files

	# cd ~/annex
	# cp /tmp/big_file .
	# cp /tmp/debian.iso .
	# git annex add .
	add big_file ok
	add debian.iso ok
	# git commit -a -m added

When you add a file to the annex and commit it, only a symlink to
the annexed content is committed. The content itself is stored in
git-annex's backend.

## renaming files

	# cd ~/annex
	# git mv big_file my_cool_big_file
	# mkdir iso
	# git mv debian.iso iso/
	# git commit -m moved

You can use any normal git operations to move files around, or even
make copies or delete them.

Notice that, since annexed files are represented by symlinks,
the symlink will break when the file is moved into a subdirectory.
But, git-annex will fix this up for you when you commit --
it has a pre-commit hook that watches for and corrects broken symlinks.

## getting file content

A repository does not always have all annexed file contents available.
When you need the content of a file, you can use "git annex get" to
make it available.

We can use this to copy everything in the laptop's annex to the
USB drive.

	# cd /media/usb/annex
	# git pull laptop master
	# git annex get .
	get my_cool_big_file (copying from laptop...) ok
	get iso/debian.iso (copying from laptop...) ok

Notice that you had to git pull from laptop first, this lets git-annex know
what has changed in laptop, and so it knows about the files present there and
can get them.

## transferring files: When things go wrong

After a while, you'll have serveral annexes, with different file contents.
You don't have to try to keep all that straight; git-annex does 
[[location_tracking]] for you. If you ask it to get a file and the drive
or file server is not accessible, it will let you know what it needs to get
it:

	# git annex get video/hackity_hack_and_kaxxt.mov
	get video/_why_hackity_hack_and_kaxxt.mov (not available)
	  Unable to access these remotes: usbdrive, server
	  Try making some of these repositories available:
	  	5863d8c0-d9a9-11df-adb2-af51e6559a49  -- my home file server
	   	58d84e8a-d9ae-11df-a1aa-ab9aa8c00826  -- portable USB drive
	   	ca20064c-dbb5-11df-b2fe-002170d25c55  -- backup SATA drive
	failed
	# sudo mount /media/usb
	# git annex get video/hackity_hack_and_kaxxt.mov
	get video/hackity_hack_and_kaxxt.mov (copying from usbdrive...) ok
	# git commit -a -m "got a video I want to rewatch on the plane"

## removing files

You can always drop files safely. Git-annex checks that some other annex
has the file before removing it.

	# git annex drop iso/debian.iso
	drop iso/Debian_5.0.iso ok
	# git commit -a -m "freed up space"

## removing files: When things go wrong

Before dropping a file, git-annex wants to be able to look at other
remotes, and verify that they still have a file. After all, it could
have been dropped from them too. If the remotes are not mounted/available,
you'll see something like this.

	# git annex drop important_file other.iso
	drop important_file (unsafe)
	  Could only verify the existence of 0 out of 1 necessary copies
	  Unable to access these remotes: usbdrive
	  Try making some of these repositories available:
	   	58d84e8a-d9ae-11df-a1aa-ab9aa8c00826  -- portable USB drive
	   	ca20064c-dbb5-11df-b2fe-002170d25c55  -- backup SATA drive
	  (Use --force to override this check, or adjust annex.numcopies.)
	failed
	drop other.iso (unsafe)
	  Could only verify the existence of 0 out of 1 necessary copies
          No other repository is known to contain the file.
	  (Use --force to override this check, or adjust annex.numcopies.)
	failed

Here you might --force it to drop `important_file` if you [[trust]] your backup.
But `other.iso` looks to have never been copied to anywhere else, so if
it's something you want to hold onto, you'd need to transfer it to
some other repository before dropping it.

## modifying annexed files

Normally, the content of files in the annex is prevented from being modified.
That's a good thing, because it might be the only copy, you wouldn't
want to lose it in a fumblefingered mistake.

	# echo oops > my_cool_big_file
	bash: my_cool_big_file: Permission deined

In order to modify a file, it should first be unlocked.

	# git annex unlock my_cool_big_file
	unlock my_cool_big_file (copying...) ok

That replaces the symlink that normally points at its content with a copy
of the content. You can then modify the file like any regular file. Because
it is a regular file.

(If you decide you don't need to modify the file after all, or want to discard
modifications, just use `git annex lock`.)

When you `git commit`, git-annex's pre-commit hook will automatically
notice that you are committing an unlocked file, and add its new content
to the annex. The file will be replaced with a symlink to the new content,
and this symlink is what gets committed to git in the end.

	# echo "now smaller, but even cooler" > my_cool_big_file
	# git commit my_cool_big_file -m "changed an annexed file"
	add my_cool_big_file ok
	[master 64cda67] changed an annexed file
	 2 files changed, 2 insertions(+), 1 deletions(-)
	 create mode 100644 .git-annex/WORM:1289672605:30:file.log

There is one problem with using `git commit` like this: Git wants to first
stage the entire contents of the file in its index. That can be slow for
big files (sorta why git-annex exists in the first place). So, the
automatic handling on commit is a nice safety feature, since it prevents
the file content being accidentially commited into git. But when working with
big files, it's faster to explicitly add them to the annex yourself
before committing.

	# echo "now smaller, but even cooler yet" > my_cool_big_file
	# git annex add my_cool_big_file
	add my_cool_big_file ok
	# git commit my_cool_big_file -m "changed an annexed file"

## using ssh remotes

So far in this walkthrough, git-annex has been used with a remote
repository on a USB drive. But it can also be used with a git remote
that is truely remote, a host accessed by ssh.

Say you have a desktop on the same network as your laptop and want
to clone the laptop's annex to it:

	# git clone ssh://mylaptop/home/me/annex ~/annex
	# cd ~/annex
	# git annex init "my desktop"

Now you can get files and they will be transferred (using `rsync`):

	# git annex get my_cool_big_file
	get my_cool_big_file (getting UUID for origin...) (copying from origin...)
	WORM:1285650548:2159:my_cool_big_file       100% 2159     2.1KB/s   00:00
	ok

When you drop files, git-annex will ssh over to the remote and make
sure the file's content is still there before removing it locally:

	# git annex drop my_cool_big_file
	drop my_cool_big_file (checking origin..) ok

Note that normally git-annex prefers to use non-ssh remotes, like
a USB drive, before ssh remotes. They are assumed to be faster/cheaper to
access, if available. There is a annex-cost setting you can configure in
`.git/config` to adjust which repositories it prefers. See
[[the_man_page|git-annex]] for details.

Also, note that you need full shell access for this to work -- 
git-annex needs to be able to ssh in and run commands.

## moving file content between repositories

Often you will want to move some file contents from a repository to some
other one. For example, your laptop's disk is getting full; time to move
some files to an external disk before moving another file from a file
server to your laptop. Doing that by hand (by using `git annex get` and
`git annex drop`) is possible, but a bit of a pain. `git annex move`
makes it very easy.

	# git annex move my_cool_big_file --to usbdrive
	move my_cool_big_file (moving to usbdrive...) ok
	# git annex move video/hackity_hack_and_kaxxt.mov --from fileserver
	move video/hackity_hack_and_kaxxt.mov (moving from fileserver...)
	WORM:1274316523:86050597:hackity_hack_and_kax 100%   82MB 199.1KB/s   07:02
	ok

## using the URL backend

git-annex has multiple key-value [[backends]]. So far this walkthrough has
demonstrated the default, WORM (Write Once, Read Many) backend. 

Another handy backend is the URL backend, which can fetch file's content
from remote URLs. Here's how to set up some files in your repository
that use this backend:

	# git annex fromkey --backend=URL --key=http://www.archive.org/somefile somefile
	fromkey somefile ok
	# git commit -m "added a file from the Internet Archive"

Now you if you ask git-annex to get that file, it will download it, 
and cache it locally.

	# git annex get somefile
	get somefile (downloading)
	#########################################################################100.0%
	ok

You can always drop files downloaded by the URL backend. It is assumed
that the URL is stable; no local backup is kept.

	# git annex drop somefile
	drop somefile (ok)

## using the SHA1 backend

Another handy alternative to the default [[backend|backends]] is the
SHA1 backend. This backend provides more git-style assurance that your data
has not been damanged. And the checksum means that when you add the same
content to the annex twice, only one copy need be stored in the backend.

The only reason it's not the default is that it needs to checksum
files when they're added to the annex, and this can slow things down
significantly for really big files. To make SHA1 the detault, just
add something like this to `.gitattributes`:

	* annex.backend=SHA1

## migrating data to a new backend

Maybe you started out using the WORM backend, and have now configured
git-annex to use SHA1. But files you added to the annex before still
use the WORM backend. There is a simple command that can migrate that
data:

	# git annex migrate my_cool_big_file
	migrate my_cool_big_file (checksum...) ok

You can only migrate files whose content is currently available. Other
files will be skipped.

After migrating a file to a new backend, the old content in the old backend
will still be present. That is necessary because multiple files
can point to the same content. The `git annex unused` sucommand can be
used to clear up that detritus later. Note that hard links are used,
to avoid wasting disk space.

## unused data

It's possible for data to accumulate in the annex that no files point to
anymore. One way it can happen is if you `git rm` a file without 
first calling `git annex drop`. And, when you modify an annexed file, the old
content of the file remains in the annex. Another way is when migrating
between backends.

This might be historical data you want to preserve, so git-annex defaults to
preserving it. So from time to time, you may want to check for such data and
eliminate it to save space.

	# git annex unused
	unused  (checking for unused data...) 
	  Some annexed data is no longer pointed to by any files in the repository.
	    NUMBER  KEY
	    1       WORM:1289672605:3:file
	    2       WORM:1289672605:14:file
	  (To see where data was previously used, try: git log --stat -S'KEY')
	  (To remove unwanted data: git-annex dropunused NUMBER)
	ok

After running `git annex unused`, you can follow the instructions to examine
the history of files that used the data, and if you decide you don't need that
data anymore, you can easily remove it:

	# git annex dropunused 1
	dropunused 1 ok

Hint: To drop a lot of unused data, use a command like this:

	# git annex dropunused `seq 1 1000`

## fsck: verifying your data

You can use the fsck subcommand to check for problems in your data.
What can be checked depends on the [[backend|backends]] you've used to store
the data. For example, when you use the SHA1 backend, fsck will verify that
the checksums of your files are good. Fsck also checks that the annex.numcopies
setting is satisfied for all files.

	# git annex fsck
	unused  (checking for unused data...) ok
	fsck my_cool_big_file (checksum...) ok
	...

You can also specifiy the files to check.  This is particularly useful if 
you're using sha1 and don't want to spend a long time checksumming everything.

	# git annex fsck my_cool_big_file
	fsck my_cool_big_file (checksum...) ok

## fsck: When things go wrong

Fsck never deletes possibly bad data; instead it will be moved to
`.git/annex/bad/` for you to recover. Here is a sample of what fsck
might say about a badly messed up annex:

	# git annex fsck
	fsck my_cool_big_file (checksum...)
	git-annex: Bad file content; moved to .git/annex/bad/SHA1:7da006579dd64330eb2456001fd01948430572f2
	git-annex: ** No known copies of the file exist!
	failed
	fsck important_file
	git-annex: Only 1 of 2 copies exist. Run git annex get somewhere else to back it up.
	failed
	git-annex: 2 failed

## backups

git-annex can be configured to require more than one copy of a file exists,
as a simple backup for your data. This is controled by the "annex.numcopies"
setting, which defaults to 1 copy. Let's change that to require 2 copies,
and send a copy of every file to a USB drive.

	# echo "* annex.numcopies=2" >> .gitattributes
	# git annex copy . --to usbdrive

Now when we try to `git annex drop` a file, it will verify that it
knows of 2 other repositories that have a copy before removing its
content from the current repository.

You can also vary the number of copies needed, depending on the file name.
So, if you want 3 copies of all your flac files, but only 1 copy of oggs:

	# echo "*.ogg annex.numcopies=1" >> .gitattributes
	# echo "*.flac annex.numcopies=3" >> .gitattributes

Or, you might want to make a directory for important stuff, and configure
it so anything put in there is backed up more thoroughly:

	# mkdir important_stuff
	# echo "* annex.numcopies=3" > important_stuff/.gitattributes

For more details about the numcopies setting, see [[copies]].

## untrusted repositories

Suppose you have a USB thunb drive and are using it as a git annex
repository. You don't trust the drive, because you could lose it, or
accidentially run it through the laundry. Or, maybe you have a drive that
you know is dying, and you'd like to be warned if there are any files
on it not backed up somewhere else. Maybe the drive has already died
or been lost.

You can let git-annex know that you don't trust a repository, and it will
adjust its behavior to avoid relying on that repositories's continued
availability.
	
	# git annex untrust usbdrive
	untrust usbdrive ok

Now when you do a fsck, you'll be warned appropriately:

	# git annex fsck .
	fsck my_big_file
	  Only these untrusted locations may have copies of this file!
	  	05e296c4-2989-11e0-bf40-bad1535567fe  -- portable USB drive
	  Back it up to trusted locations with git-annex copy.
	failed

Also, git-annex will refuse to drop a file from elsewhere just because
it can see a copy on the untrusted repository.

It's also possible to tell git-annex that you have an unusually high
level of trust for a repository. See [[trust]] for details.