aboutsummaryrefslogtreecommitdiff
path: root/doc/tips/Internet_Archive_via_S3.mdwn
blob: be802b5b2ec5c68def810a458d1234b189cd42be (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
[The Internet Archive](http://www.archive.org/) allows members to upload
collections using an Amazon S3 
[compatible API](http://www.archive.org/help/abouts3.txt), and this can
be used with git-annex's [[special_remotes/S3]] support. 

So, you can locally archive things with git-annex, define remotes that
correspond to "items" at the Internet Archive, and use git-annex to upload
your files to there. Of course, your use of the Internet Archive must
comply with their [terms of service](http://www.archive.org/about/terms.php).

A nice added feature is that whenever git-annex sends a file to the
Internet Archive, it records its url, the same as if you'd run `git annex
addurl`. So any users who can clone your repository can download the files
from archive.org, without needing any login or password info. 
The url to the content in the Internet Archive is also displayed by
`git annex whereis`. This makes the Internet Archive a nice way to
publish the large files associated with a public git repository.

## webapp setup

Just go to "Add Another Repository", pick "Internet Archive",
and you're on your way.

## basic setup

Sign up for an account, and get your access keys here:
<http://www.archive.org/account/s3.php>
	
	# export AWS_ACCESS_KEY_ID=blahblah
	# export AWS_SECRET_ACCESS_KEY=xxxxxxx

Specify `host=s3.us.archive.org` when doing `initremote` to set up
a remote at the Archive. This will enable a special Internet Archive mode:
Encryption is not allowed; you are required to specify a bucket name
rather than having git-annex pick a random one; and you can optionally
specify `x-archive-meta*` headers to add metadata as explained in their
[documentation](http://www.archive.org/help/abouts3.txt).

	# git annex initremote archive-panama type=S3 \
		host=s3.us.archive.org bucket=panama-canal-lock-blueprints \
		x-archive-meta-mediatype=texts x-archive-meta-language=eng \
		x-archive-meta-title="original Panama Canal lock design blueprints"
	initremote archive-panama (Internet Archive mode) ok
	# git annex describe archive-panama "a man, a plan, a canal: panama"
	describe archive-panama ok

Then you can annex files and copy them to the remote as usual:

	# git annex add photo1.jpeg --backend=SHA256E
	add photo1.jpeg (checksum...) ok
	# git annex copy photo1.jpeg --fast --to archive-panama
	copy (to archive-panama...) ok

It may take a while for archive.org to make files publically visible after
they've been uploaded.

## removing files

While files can be removed from the Internet Archive, 
[derived versions](https://archive.org/help/derivatives.php)
of some files may continued to be stored there after the originals
were removed. git-annex warns about this problem.

## exporting trees

By default, files stored in the Internet Archive will show up there named
by their git-annex key, not the original filename. If the filenames
are important, you can run `git annex initremote` with an additional
parameter "exporttree=yes", and then use [[git-annex-export]] to publish
a tree of files to the Internet Archive.

Note that the Internet Archive may not support certian characters
in filenames ([see FAQ](http://archive.org/about/faqs.php#1099)).
If exporting a filename fails due to such limitations, you would need
to rename it in your git annex repository in order to export it.