diff options
Diffstat (limited to 'doc/design')
-rw-r--r-- | doc/design/assistant/chunks.mdwn | 177 |
1 files changed, 177 insertions, 0 deletions
diff --git a/doc/design/assistant/chunks.mdwn b/doc/design/assistant/chunks.mdwn index 6a92731ee..53dbf20f4 100644 --- a/doc/design/assistant/chunks.mdwn +++ b/doc/design/assistant/chunks.mdwn @@ -5,3 +5,180 @@ May be a useful starting point for [[deltas]]. May also allow for downloading different chunks of a file concurrently from multiple remotes. + +# currently + +Currently, only the webdav and directory special remotes support chunking. + +Filenames are used for the chunks that make it easy to see which chunks +belong together, even when encryption is used. There is also a chunkcount +file, that similarly leaks information. + +It is not currently possible to enable chunking on a non-chunked remote. + +Problem: Two uploads of the same key from repos with different chunk sizes +could lead to data loss. For example, suppose A is 10 mb, and B is 20 mb, +and the upload speed is the same. If B starts first, when A will overwrite +the file it is uploading for the 1st chunk. Then A uploads the second +chunk, and once A is done, B finishes the 1st chunk and uploads its second. +We now have 1(from A), 2(from B). + +This needs to be supported for back-compat, so keep the chunksize= setting +to enable that mode, and add a new setting for the new mode. + +# new requirements + +Every special remote should support chunking. (It does not make sense +to support it for git remotes, but gcrypt remotes should support it.) + +S3 remotes should chunk by default, because the current S3 backend fails +for files past a certian size. See [[bugs/]] + +The size of chunks, as well as whether any chunking is done, should be +configurable on the fly without invalidating data already stored in the +remote. This seems important for usability (eg, so users can turn chunking +on in the webapp when configuring an existing remote). + +Two concurrent uploaders of the same object to a remote should be safe, +even if they're using different chunk sizes. + +# obscuring file sizes + +To hide from a remote any information about the sizes of files could be +another goal of chunking. At least two things are needed for this: + +1. The filenames used on the remote don't indicate which chunks belong + together. + +2. The final short chunk needs to be padded with random data, + so that a remote sees only encrypted files with uniform sizes + and cannot make guesses about the kinds of data being stored. + +Note that encrypting the whole file and then chunking and padding it is not +good because the remote can probably examine files and tell when a gpg +stream has been cut into peices, even without the key (have not verified +this, but it seems likely; certianly gpg magic numbers can identify gpg +encrypted files so a file that's encrypted but lacks the magic is not the +first chunk..). + +Note that padding cannot completely hide all information from an attacker +who is logging puts or gets. An attacker could, for example, look at the +times of puts, and guess at when git-annex has moved on to +encrypting/decrypting the next object, and so guess at the approximate +sizes of objects. (Concurrent uploads/downloads or random delays could be +added to prevent these kinds of attacks.) + +And, obviously, if someone stores 10 tb of data in a remote, they probably +have around 10 tb of files, so it's probably not a collection of recipes.. + +Given its inneficiencies and lack of fully obscuring file sizes, +padding may not be worth adding. + +# design 1 + +Add an optional chunk field to Key. It is only present for chunk +2 and above. Ie, SHA256-s12345--xxxxxxx is the first chunk (or whole +object), while SHA256-s12345-c2--xxxxxxx is the second chunk. + +On an encrypted remote, Keys are generated with the chunk field, and then +HMAC enrypted. + +Note that only using it for chunks 2+ means that git-annex can start by +requesting the regular key, so an observer sees the same request whether +chunked or not, and does not see eg, a pattern of failed requests for +a non-chunked key, followed by successful requests for the chunked keys. +(Both more efficient and perhaps more secure.) + +Problem: This makes putting chunks easy. But there is a problem when getting +an object that has been chunked. If the key size is not known, we +cannot tell when we've gotten the last chunk. (Also, we cannot strip +padding.) Note that `addurl` sometimes generates keys w/o size info +(particularly, it does so by design when using quvi). + +Problem: Also, this makes `hasKey` hard to implement: How can it know if +all the chunks are present, if the key size is not known? + +Problem: Also, this makes it difficult to download encrypted keys, because +we only know the decrypted size, not the encrypted size, so we can't +be sure how many chunks to get, and all chunks need to be downloaded before +we can decrypt any of them. + +Problem: Does not solve concurrent uploads with different chunk sizes. + +# design 2 + +When chunking is enabled, always put a chunk number in the Key, +along with the chunk size. +So, SHA256-s10000-c1--xxxxxxx for the first chunk of 1 megabyte. + +Before any chunks are stored, write a chunkcount file, eg +SHA256-s12345-c0--xxxxxxx. Note that this key is the same as the original +object's key, except with chunk number set to 0. This file contains both +the number of chunks, and also the chunk size used. `hasKey` downloads this +file, and then verifies that each chunk is present, looking for keys with +the expected chunk numbers and chunk size. + +This avoids problems with multiple writers using different chunk sizes, +since they will be uploading to different files. + +Problem: In such a situation, some duplicate data might be stored, not +referenced by the last chunkcount file to be written. It would not be +dropped when the key was removed from the remote. + +Note: This design lets an attacker with logs tell the (appoximate) size of +objects, by finding the small files that contain a chunk count, and +correlating when that is written/read and when other files are +written/read. That could be solved by padding the chunkcount key up to the +size of the rest of the keys, but that's very innefficient; `hasKey` is not +designed to need to download large files. + +# design 3 + +Like design 1, but add an encrypted chunk count prefix to the first object. +This needs to be done in a way that does not let an attacker tell if the +object has an encrypted chunk count prefix or not. + +This seems difficult; attacker could probably tell where the first encrypted +part stops and the next encrypted part starts by looking for gpg headers, +and so tell which files are the first chunks. + +Also, `hasKey` would need to download some or all of the first file. +If all, that's a lot more expensive. If only some is downloaded, an +attacker can guess that the file that was partially downloaded is the +first chunk in a series, and wait for a time when it's fully downloaded to +determine which are the other chunks. + +Problem: Two uploads of the same key from repos with different chunk sizes +could lead to data loss. (Same as in design 2.) + +# design 4 + +Instead of storing the chunk count in the special remote, store it in +the git-annex branch. + +So, use key SHA256-s10000-c1--xxxxxxx for the first chunk of 1 megabyte. + +And look at git-annex:aaa/bbb/SHA256-s12345--xxxxxxx.log.cnk to get the +chunk count and size. File format would be: + + ts uuid chunksize chunkcount + +Note that a given remote uuid might have multiple lines, if a key was +stored on it twice using different chunk sizes. Also note that even when +this file exists for a key, the object may be stored non-chunked on the +remote too. + +`hasKey` would check if any one (chunksize, chunkcount) is satisfied by +the files on the remote. It would also check if the non-chunked key is +present. + +When dropping a key from the remote, drop all logged chunk sizes. +As long as the location log and the new log are committed atomically, +this guarantees that no orphaned chunks end up on a remote +(except any that might be left by interrupted uploads). +(Also drop any non-chunked key.) + +This has the best security of the designs so far, because the special +remote doesn't know anything about chunk sizes. It uses a little more +data in the git-annex branch, although with care (using the same timestamp +as the location log), it can compress pretty well. |