diff options
-rw-r--r-- | doc/design/assistant/deltas.mdwn | 24 | ||||
-rw-r--r-- | doc/devblog/day_206__zap.mdwn | 83 |
2 files changed, 104 insertions, 3 deletions
diff --git a/doc/design/assistant/deltas.mdwn b/doc/design/assistant/deltas.mdwn index ff4185a18..0f7d308b8 100644 --- a/doc/design/assistant/deltas.mdwn +++ b/doc/design/assistant/deltas.mdwn @@ -4,6 +4,24 @@ One simple way is to find the key of the old version of a file that's being transferred, so it can be used as the basis for rsync, or any other similar transfer protocol. -For remotes that don't use rsync, a poor man's version could be had by -chunking each object into multiple parts. Only modified parts need be -transferred. Sort of sub-keys to the main key being stored. +For remotes that don't use rsync, use a rolling checksum based chunker, +such as BuzHash. This will produce [[chunks]], which can be stored on the +remote as regular Keys -- where unlike the fixed size chunk keys, the +SHA256 part of these keys is the checksum of the chunk they contain. + +Once that's done, it's easy to avoid uploading chunks that have been sent +to the remote before. + +When retriving a new version of a file, there would need to be a way to get +the list of chunk keys that constitute the new version. Probably best to +store this list on the remote. Then there needs to be a way to find which +of those chunks are available in locally present files, so that the locally +available chunks can be extracted, and combined with the chunks that need +to be downloaded, to reconstitute the file. + +To find which chucks are locally available, here are 2 ideas: + +1. Use a single basis file, eg an old version of the file. Re-chunk it, and + use its chunks. Slow, but simple. +2. Some kind of database of locally available chunks. Would need to be kept + up-to-date as files are added, and as files are downloaded. diff --git a/doc/devblog/day_206__zap.mdwn b/doc/devblog/day_206__zap.mdwn new file mode 100644 index 000000000..eccee2464 --- /dev/null +++ b/doc/devblog/day_206__zap.mdwn @@ -0,0 +1,83 @@ +Zap! ... My internet gateway was [destroyed by lightning](https://identi.ca/joeyh/note/xogvXTFDR9CZaCPsmKZipA). +Limping along regardless, and replacement ordered. + +Got resuming of uploads to chunked remotes working. Easy! + +---- + +Next I want to convert the external special remotes to have these nice +new features. But there is a wrinkle: The new chunking interface works +entirely on ByteStrings containing the content, but the external special +remote interface passes content around in files. + +I could just make it write the ByteString to a temp file, and pass the temp +file to the external special remote to store. But then, when chunking is +not being used, it would pointlessly read a file's content, only to write +it back out to a temp file. + +Similarly, when retrieving a key, the external special remote saves it to a +file. But we want a ByteString. Except, when not doing chunking or +encryption, letting the external special remote save the content directly +to a file is optimal. + +One approach would be to change the protocol for external special +remotes, so that the content is sent over the protocol rather than in temp +files. But I think this would not be ideal for some kinds of external +special remotes, and it would probably be quite a lot slower and more +complicated. + +Instead, I am playing around with some type class trickery: + +[[!format haskell """ +{-# LANGUAGE Rank2Types TypeSynonymInstances FlexibleInstances MultiParamTypeClasses #-} + +type Storer p = Key -> p -> MeterUpdate -> IO Bool + +-- For Storers that want to be provided with a file to store. +type FileStorer a = Storer (ContentPipe a FilePath) + +-- For Storers that want to be provided with a ByteString to store +type ByteStringStorer a = Storer (ContentPipe a L.ByteString) + +class ContentPipe src dest where + contentPipe :: src -> (dest -> IO a) -> IO a + +instance ContentPipe L.ByteString L.ByteString where + contentPipe b a = a b + +-- This feels a lot like I could perhaps use pipes or conduit... +instance ContentPipe FilePath FilePath where + contentPipe f a = a f + +instance ContentPipe L.ByteString FilePath where + contentPipe b a = withTmpFile "tmpXXXXXX" $ \f h -> do + L.hPut h b + hClose h + a f + +instance ContentPipe FilePath L.ByteString where + contentPipe f a = a =<< L.readFile f +"""]] + +The external special remote would be a FileStorer, so when a non-chunked, +non-encrypted file is provided, it just runs on the FilePath with no extra +work. While when a ByteString is provided, it's swapped out to a temp file +and the temp file provided. And many other special remotes are ByteStorers, +so they will just pass the provided ByteStream through, or read in the +content of a file. + +I think that would work. Thoigh it is not optimal for external special +remotes that are chunked but not encrypted. For that case, it might be worth +extending the special remote protocol with a way to say "store a chunk of +this file from byte N to byte M". + +--- + +Also, talked with ion about what would be involved in using rolling checksum +based chunks. That would allow for rsync or zsync like behavior, where +when a file changed, git-annex uploads only the chunks that changed, and the +unchanged chunks are reused. + +I am not ready to work on that yet, but I made some changes to the parsing +of the chunk log, so that additional chunking schemes like this can be added +to git-annex later without breaking backwards compatability. |