summaryrefslogtreecommitdiff
path: root/doc/design
diff options
context:
space:
mode:
Diffstat (limited to 'doc/design')
-rw-r--r--doc/design/assistant/chunks.mdwn50
-rw-r--r--doc/design/assistant/deltas.mdwn24
-rw-r--r--doc/design/roadmap.mdwn9
3 files changed, 68 insertions, 15 deletions
diff --git a/doc/design/assistant/chunks.mdwn b/doc/design/assistant/chunks.mdwn
index 454f15f9e..a9709a778 100644
--- a/doc/design/assistant/chunks.mdwn
+++ b/doc/design/assistant/chunks.mdwn
@@ -160,17 +160,11 @@ Instead of storing the chunk count in the special remote, store it in
the git-annex branch.
The location log does not record locations of individual chunk keys
-(too space-inneficient).
-Instead, look at git-annex:aaa/bbb/SHA256-s12345--xxxxxxx.log.cnk to get
-the chunk count and size for a key.
+(too space-inneficient). Instead, look at a chunk log in the
+git-annex branch to get the chunk count and size for a key.
-Note that a given remote uuid might have multiple chunk sizes logged, if a
-key was stored on it twice using different chunk sizes. Also note that even
-when this file exists for a key, the object may be stored non-chunked on
-the remote too.
-
-`hasKey` would check if any one (chunksize, chunkcount) is satisfied by
-the files on the remote. It would also check if the non-chunked key is
+`hasKey` would check if any of the logged sets of chunks is
+present on the remote. It would also check if the non-chunked key is
present, as a fallback.
When dropping a key from the remote, drop all logged chunk sizes.
@@ -185,6 +179,31 @@ remote doesn't know anything about chunk sizes. It uses a little more
data in the git-annex branch, although with care (using the same timestamp
as the location log), it can compress pretty well.
+## chunk log
+
+Stored in the git-annex branch, this provides a mapping `Key -> [[Key]]`.
+
+Note that a given remote uuid might have multiple sets of chunks (with
+different sizes) logged, if a key was stored on it twice using different
+chunk sizes. Also note that even when the log indicates a key is chunked,
+the object may be stored non-chunked on the remote too.
+
+For fixed size chunks, there's no need to store the list of chunk keys,
+instead the log only records the number of chunks (needed because the size
+of the parent Key may not be known), and the chunk size.
+
+Example:
+
+ 1287290776.765152s e605dca6-446a-11e0-8b2a-002170d25c55:10240 9
+
+Later, might want to support other kinds of chunks, for example ones made
+using a rsync-style rolling checksum. It would probably not make sense to
+store the full [Key] list for such chunks in the log. Instead, it might be
+stored in a file on the remote.
+
+To support such future developments, when updating the chunk log,
+git-annex should preserve unparsable values (the part after the colon).
+
## chunk then encrypt
Rather than encrypting the whole object 1st and then chunking, chunk and
@@ -239,3 +258,14 @@ checking hasKey.
Note that this is safe to do only as long as the Key being transferred
cannot possibly have 2 different contents in different repos. Notably not
necessarily the case for the URL keys generated for quvi.
+
+Both **done**.
+
+## parallel
+
+If 2 remotes both support chunking, uploading could upload different chunks
+to them in parallel. However, the chunk log does not currently allow
+representing the state where some chunks are on one remote and others on
+another remote.
+
+Parallel downloading of chunks from different remotes is a bit more doable.
diff --git a/doc/design/assistant/deltas.mdwn b/doc/design/assistant/deltas.mdwn
index ff4185a18..0f7d308b8 100644
--- a/doc/design/assistant/deltas.mdwn
+++ b/doc/design/assistant/deltas.mdwn
@@ -4,6 +4,24 @@ One simple way is to find the key of the old version of a file that's
being transferred, so it can be used as the basis for rsync, or any
other similar transfer protocol.
-For remotes that don't use rsync, a poor man's version could be had by
-chunking each object into multiple parts. Only modified parts need be
-transferred. Sort of sub-keys to the main key being stored.
+For remotes that don't use rsync, use a rolling checksum based chunker,
+such as BuzHash. This will produce [[chunks]], which can be stored on the
+remote as regular Keys -- where unlike the fixed size chunk keys, the
+SHA256 part of these keys is the checksum of the chunk they contain.
+
+Once that's done, it's easy to avoid uploading chunks that have been sent
+to the remote before.
+
+When retriving a new version of a file, there would need to be a way to get
+the list of chunk keys that constitute the new version. Probably best to
+store this list on the remote. Then there needs to be a way to find which
+of those chunks are available in locally present files, so that the locally
+available chunks can be extracted, and combined with the chunks that need
+to be downloaded, to reconstitute the file.
+
+To find which chucks are locally available, here are 2 ideas:
+
+1. Use a single basis file, eg an old version of the file. Re-chunk it, and
+ use its chunks. Slow, but simple.
+2. Some kind of database of locally available chunks. Would need to be kept
+ up-to-date as files are added, and as files are downloaded.
diff --git a/doc/design/roadmap.mdwn b/doc/design/roadmap.mdwn
index 631280828..7a3fa06fe 100644
--- a/doc/design/roadmap.mdwn
+++ b/doc/design/roadmap.mdwn
@@ -14,5 +14,10 @@ Now in the
* Month 8 [[!traillink git-remote-daemon]]
* Month 9 Brazil!, [[!traillink assistant/sshpassword]]
* Month 10 polish [[assistant/Windows]] port
-* **Month 11 [[!traillink assistant/chunks]], [[!traillink assistant/deltas]], [[!traillink assistant/gpgkeys]] (pick 2?)**
-* Month 12 [[!traillink assistant/telehash]]
+* Month 11 [[!traillink assistant/chunks]]
+* **Month 12** user-driven features and polishing
+
+Deferred until later:
+
+* Month XX [[!traillink assistant/deltas]], [[!traillink assistant/gpgkeys]]
+* Month XX [[!traillink assistant/telehash]]