summaryrefslogtreecommitdiff
path: root/doc/design
diff options
context:
space:
mode:
authorGravatar Joey Hess <joeyh@joeyh.name>2015-03-03 22:30:20 -0400
committerGravatar Joey Hess <joeyh@joeyh.name>2015-03-03 22:30:20 -0400
commit68e419ce4cbb442ddf905678421cc8a818f52117 (patch)
tree00d7f3a3c5fef804ba8e0d4f6c4be8fae723837a /doc/design
parent0d13459e18f2a3285f8ea0fe851c2f03801cec23 (diff)
update
Diffstat (limited to 'doc/design')
-rw-r--r--doc/design/iabackup.mdwn27
1 files changed, 25 insertions, 2 deletions
diff --git a/doc/design/iabackup.mdwn b/doc/design/iabackup.mdwn
index fe5b86649..5fe656f08 100644
--- a/doc/design/iabackup.mdwn
+++ b/doc/design/iabackup.mdwn
@@ -1,6 +1,8 @@
This is a fairly detailed design proposal for using git-annex to build
<http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK>
+[[!toc ]]
+
## sharding to scale
The IA contains some 24 million Items.
@@ -33,6 +35,10 @@ them.
* Add new shards as the IA continues to grow.
+Question: How many files are in IA across all Items? It might be better
+to use $item/$file rather than $item.tar as the unit that's stored in
+the git-annex repository. This would need more shards.
+
## the IA git repository
We're building a pyramid of git-annex repositories, and at the tip
@@ -176,6 +182,23 @@ drill.
(Remember to turn off the fire alarm by running
`setpresentkey $key $iauuid 1`)
+## shard servers
+
+A server at the IA (or otherwise with a fast pipe) is needed to serve one or
+more shards. Let's consider what this server needs to have on it:
+
+* git and git-annex
+* ssh server
+* rsync
+* The git repository for the shard. Probably a few hundred mb?
+* The git update hook to filter out bad pushes.
+* Some way to get the content of a given Item from the IA
+ when a client wants to download it. This probably means
+ generating the $item.tar file and buffering it to disk for a while.
+* So, enough disk to buffer a reasonable number of items.
+* Some way to learn when a new user has registered to access a shard,
+ so their ssh key is given access.
+
## other optional nice stuff
The user running a client can delete some or all of their files at any
@@ -226,8 +249,8 @@ this seems excessive).
There may be a thundering herd problem, where many clients end up
downloading the same Item at the same time, and more copies than neecessary
result. The next `git annex sync --content` in some of the
-redundant clients will notice this and drop that item, and presumably
-download some other item. However, it might be good to rate limit the
+redundant clients will notice this and drop that Item, and presumably
+download some other Item. However, it might be good to rate limit the
number of concurrent downloads of a given item, to prevent this and perhaps
other issues. This could be done by a wrapper around git-annex shell or
perhaps a git-annex modification.