diff options
author | Joey Hess <joeyh@joeyh.name> | 2015-03-03 22:30:20 -0400 |
---|---|---|
committer | Joey Hess <joeyh@joeyh.name> | 2015-03-03 22:30:20 -0400 |
commit | 68e419ce4cbb442ddf905678421cc8a818f52117 (patch) | |
tree | 00d7f3a3c5fef804ba8e0d4f6c4be8fae723837a /doc | |
parent | 0d13459e18f2a3285f8ea0fe851c2f03801cec23 (diff) |
update
Diffstat (limited to 'doc')
-rw-r--r-- | doc/design/iabackup.mdwn | 27 |
1 files changed, 25 insertions, 2 deletions
diff --git a/doc/design/iabackup.mdwn b/doc/design/iabackup.mdwn index fe5b86649..5fe656f08 100644 --- a/doc/design/iabackup.mdwn +++ b/doc/design/iabackup.mdwn @@ -1,6 +1,8 @@ This is a fairly detailed design proposal for using git-annex to build <http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK> +[[!toc ]] + ## sharding to scale The IA contains some 24 million Items. @@ -33,6 +35,10 @@ them. * Add new shards as the IA continues to grow. +Question: How many files are in IA across all Items? It might be better +to use $item/$file rather than $item.tar as the unit that's stored in +the git-annex repository. This would need more shards. + ## the IA git repository We're building a pyramid of git-annex repositories, and at the tip @@ -176,6 +182,23 @@ drill. (Remember to turn off the fire alarm by running `setpresentkey $key $iauuid 1`) +## shard servers + +A server at the IA (or otherwise with a fast pipe) is needed to serve one or +more shards. Let's consider what this server needs to have on it: + +* git and git-annex +* ssh server +* rsync +* The git repository for the shard. Probably a few hundred mb? +* The git update hook to filter out bad pushes. +* Some way to get the content of a given Item from the IA + when a client wants to download it. This probably means + generating the $item.tar file and buffering it to disk for a while. +* So, enough disk to buffer a reasonable number of items. +* Some way to learn when a new user has registered to access a shard, + so their ssh key is given access. + ## other optional nice stuff The user running a client can delete some or all of their files at any @@ -226,8 +249,8 @@ this seems excessive). There may be a thundering herd problem, where many clients end up downloading the same Item at the same time, and more copies than neecessary result. The next `git annex sync --content` in some of the -redundant clients will notice this and drop that item, and presumably -download some other item. However, it might be good to rate limit the +redundant clients will notice this and drop that Item, and presumably +download some other Item. However, it might be good to rate limit the number of concurrent downloads of a given item, to prevent this and perhaps other issues. This could be done by a wrapper around git-annex shell or perhaps a git-annex modification. |