summaryrefslogtreecommitdiff
path: root/doc/tips/Internet_Archive_via_S3.mdwn
blob: 20d14bdec5a6a30dd38e8870218e0793f9fb5eb3 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
[The Internet Archive](http://www.archive.org/) allows members to upload
collections using an Amazon S3 
[compatible API](http://www.archive.org/help/abouts3.txt), and this can
be used with git-annex's [[special_remotes/S3]] support. 

So, you can locally archive things with git-annex, define remotes that
correspond to "items" at the Internet Archive, and use git-annex to upload
your files to there. Of course, your use of the Internet Archive must
comply with their [terms of service](http://www.archive.org/about/terms.php).

A nice added feature is that whenever git-annex sends a file to the
Internet Archive, it records its url, the same as if you'd run `git annex
addurl`. So any users who can clone your repository can download the files
from archive.org, without needing any login or password info. This makes
the Internet Archive a nice way to publish the large files associated with
a public git repository.

## webapp setup

Just go to "Add Another Repository", pick "Internet Archive",
and you're on your way.

## basic setup

Sign up for an account, and get your access keys here:
<http://www.archive.org/account/s3.php>
	
	# export AWS_ACCESS_KEY_ID=blahblah
	# export AWS_SECRET_ACCESS_KEY=xxxxxxx

Specify `host=s3.us.archive.org` when doing `initremote` to set up
a remote at the Archive. This will enable a special Internet Archive mode:
Encryption is not allowed; you are required to specify a bucket name
rather than having git-annex pick a random one; and you can optionally
specify `x-archive-meta*` headers to add metadata as explained in their
[documentation](http://www.archive.org/help/abouts3.txt).

	# git annex initremote archive-panama type=S3 \
		host=s3.us.archive.org bucket=panama-canal-lock-blueprints \
		x-archive-meta-mediatype=texts x-archive-meta-language=eng \
		x-archive-meta-title="original Panama Canal lock design blueprints"
	initremote archive-panama (Internet Archive mode) ok
	# git annex describe archive-panama "a man, a plan, a canal: panama"
	describe archive-panama ok

Then you can annex files and copy them to the remote as usual:

	# git annex add photo1.jpeg --backend=SHA256E
	add photo1.jpeg (checksum...) ok
	# git annex copy photo1.jpeg --fast --to archive-panama
	copy (to archive-panama...) ok

Once a file has been stored on archive.org, it cannot be (easily) removed
from it. Also, git-annex whereis will tell you a public url for the file
on archive.org. (It may take a while for archive.org to make the file
publically visibile.)

## exporting trees

By default, files stored in the Internet Archive will show up there named
by their git-annex key, not the original filename. If the filenames
are important, you can run `git annex initremote` with an additional
parameter "exporttree=yes", and then use [[git-annex-export]] to publish
a tree of files to the Internet Archive.

Note that the Internet Archive does not support filenames containing
whitespace and some other characters. Exporting such problem filenames will
fail; you can rename the file and re-export.