diff options
Diffstat (limited to 'doc/design')
-rw-r--r-- | doc/design/assistant/chunks.mdwn | 50 | ||||
-rw-r--r-- | doc/design/assistant/deltas.mdwn | 24 | ||||
-rw-r--r-- | doc/design/roadmap.mdwn | 9 |
3 files changed, 68 insertions, 15 deletions
diff --git a/doc/design/assistant/chunks.mdwn b/doc/design/assistant/chunks.mdwn index 454f15f9e..a9709a778 100644 --- a/doc/design/assistant/chunks.mdwn +++ b/doc/design/assistant/chunks.mdwn @@ -160,17 +160,11 @@ Instead of storing the chunk count in the special remote, store it in the git-annex branch. The location log does not record locations of individual chunk keys -(too space-inneficient). -Instead, look at git-annex:aaa/bbb/SHA256-s12345--xxxxxxx.log.cnk to get -the chunk count and size for a key. +(too space-inneficient). Instead, look at a chunk log in the +git-annex branch to get the chunk count and size for a key. -Note that a given remote uuid might have multiple chunk sizes logged, if a -key was stored on it twice using different chunk sizes. Also note that even -when this file exists for a key, the object may be stored non-chunked on -the remote too. - -`hasKey` would check if any one (chunksize, chunkcount) is satisfied by -the files on the remote. It would also check if the non-chunked key is +`hasKey` would check if any of the logged sets of chunks is +present on the remote. It would also check if the non-chunked key is present, as a fallback. When dropping a key from the remote, drop all logged chunk sizes. @@ -185,6 +179,31 @@ remote doesn't know anything about chunk sizes. It uses a little more data in the git-annex branch, although with care (using the same timestamp as the location log), it can compress pretty well. +## chunk log + +Stored in the git-annex branch, this provides a mapping `Key -> [[Key]]`. + +Note that a given remote uuid might have multiple sets of chunks (with +different sizes) logged, if a key was stored on it twice using different +chunk sizes. Also note that even when the log indicates a key is chunked, +the object may be stored non-chunked on the remote too. + +For fixed size chunks, there's no need to store the list of chunk keys, +instead the log only records the number of chunks (needed because the size +of the parent Key may not be known), and the chunk size. + +Example: + + 1287290776.765152s e605dca6-446a-11e0-8b2a-002170d25c55:10240 9 + +Later, might want to support other kinds of chunks, for example ones made +using a rsync-style rolling checksum. It would probably not make sense to +store the full [Key] list for such chunks in the log. Instead, it might be +stored in a file on the remote. + +To support such future developments, when updating the chunk log, +git-annex should preserve unparsable values (the part after the colon). + ## chunk then encrypt Rather than encrypting the whole object 1st and then chunking, chunk and @@ -239,3 +258,14 @@ checking hasKey. Note that this is safe to do only as long as the Key being transferred cannot possibly have 2 different contents in different repos. Notably not necessarily the case for the URL keys generated for quvi. + +Both **done**. + +## parallel + +If 2 remotes both support chunking, uploading could upload different chunks +to them in parallel. However, the chunk log does not currently allow +representing the state where some chunks are on one remote and others on +another remote. + +Parallel downloading of chunks from different remotes is a bit more doable. diff --git a/doc/design/assistant/deltas.mdwn b/doc/design/assistant/deltas.mdwn index ff4185a18..0f7d308b8 100644 --- a/doc/design/assistant/deltas.mdwn +++ b/doc/design/assistant/deltas.mdwn @@ -4,6 +4,24 @@ One simple way is to find the key of the old version of a file that's being transferred, so it can be used as the basis for rsync, or any other similar transfer protocol. -For remotes that don't use rsync, a poor man's version could be had by -chunking each object into multiple parts. Only modified parts need be -transferred. Sort of sub-keys to the main key being stored. +For remotes that don't use rsync, use a rolling checksum based chunker, +such as BuzHash. This will produce [[chunks]], which can be stored on the +remote as regular Keys -- where unlike the fixed size chunk keys, the +SHA256 part of these keys is the checksum of the chunk they contain. + +Once that's done, it's easy to avoid uploading chunks that have been sent +to the remote before. + +When retriving a new version of a file, there would need to be a way to get +the list of chunk keys that constitute the new version. Probably best to +store this list on the remote. Then there needs to be a way to find which +of those chunks are available in locally present files, so that the locally +available chunks can be extracted, and combined with the chunks that need +to be downloaded, to reconstitute the file. + +To find which chucks are locally available, here are 2 ideas: + +1. Use a single basis file, eg an old version of the file. Re-chunk it, and + use its chunks. Slow, but simple. +2. Some kind of database of locally available chunks. Would need to be kept + up-to-date as files are added, and as files are downloaded. diff --git a/doc/design/roadmap.mdwn b/doc/design/roadmap.mdwn index 631280828..7a3fa06fe 100644 --- a/doc/design/roadmap.mdwn +++ b/doc/design/roadmap.mdwn @@ -14,5 +14,10 @@ Now in the * Month 8 [[!traillink git-remote-daemon]] * Month 9 Brazil!, [[!traillink assistant/sshpassword]] * Month 10 polish [[assistant/Windows]] port -* **Month 11 [[!traillink assistant/chunks]], [[!traillink assistant/deltas]], [[!traillink assistant/gpgkeys]] (pick 2?)** -* Month 12 [[!traillink assistant/telehash]] +* Month 11 [[!traillink assistant/chunks]] +* **Month 12** user-driven features and polishing + +Deferred until later: + +* Month XX [[!traillink assistant/deltas]], [[!traillink assistant/gpgkeys]] +* Month XX [[!traillink assistant/telehash]] |