To avoid leaking even the size of your encrypted files to cloud storage providers, add a mode that stores fixed size chunks. May be a useful starting point for [[deltas]]. May also allow for downloading different chunks of a file concurrently from multiple remotes. # currently Currently, only the webdav and directory special remotes support chunking. Filenames are used for the chunks that make it easy to see which chunks belong together, even when encryption is used. There is also a chunkcount file, that similarly leaks information. It is not currently possible to enable chunking on a non-chunked remote. Problem: Two uploads of the same key from repos with different chunk sizes could lead to data loss. For example, suppose A is 10 mb chunksize, and B is 20 mb, and the upload speed is the same. If B starts first, when A will overwrite the file it is uploading for the 1st chunk. Then A uploads the second chunk, and once A is done, B finishes the 1st chunk and uploads its second. We now have [chunk 1(from A), chunk 2(from B)]. # new requirements Every special remote should support chunking. (It does not make sense to support it for git remotes, but gcrypt remotes should support it.) S3 remotes should chunk by default, because the current S3 backend fails for files past a certian size. See [[bugs/Truncated_file_transferred_via_S3]]. The size of chunks, as well as whether any chunking is done, should be configurable on the fly without invalidating data already stored in the remote. This seems important for usability (eg, so users can turn chunking on in the webapp when configuring an existing remote). Two concurrent uploaders of the same object to a remote should be safe, even if they're using different chunk sizes. The old chunk method needs to be supported for back-compat, so keep the chunksize= setting to enable that mode, and add a new setting for the new mode. # obscuring file sizes To hide from a remote any information about the sizes of files could be another goal of chunking. At least two things are needed for this: 1. The filenames used on the remote don't indicate which chunks belong together. 2. The final short chunk needs to be padded with random data, so that a remote sees only encrypted files with uniform sizes and cannot make guesses about the kinds of data being stored. Note that padding cannot completely hide all information from an attacker who is logging puts or gets. An attacker could, for example, look at the times of puts, and guess at when git-annex has moved on to encrypting/decrypting the next object, and so guess at the approximate sizes of objects. (Concurrent uploads/downloads or random delays could be added to prevent these kinds of attacks.) And, obviously, if someone stores 10 tb of data in a remote, they probably have around 10 tb of files, so it's probably not a collection of recipes.. Given its inneficiencies and lack of fully obscuring file sizes, padding may not be worth adding, but is considered in the designs below. # design 1 Add an optional chunk field to Key. It is only present for chunk 2 and above. Ie, SHA256-s12345--xxxxxxx is the first chunk (or whole object), while SHA256-s12345-c2--xxxxxxx is the second chunk. On an encrypted remote, Keys are generated with the chunk field, and then HMAC enrypted. Note that only using it for chunks 2+ means that git-annex can start by requesting the regular key, so an observer sees the same request whether chunked or not, and does not see eg, a pattern of failed requests for a non-chunked key, followed by successful requests for the chunked keys. (Both more efficient and perhaps more secure.) Problem: This makes putting chunks easy. But there is a problem when getting an object that has been chunked. If the key size is not known, we cannot tell when we've gotten the last chunk. (Also, we cannot strip padding.) Note that `addurl` sometimes generates keys w/o size info (particularly, it does so by design when using quvi). Problem: Also, this makes `hasKey` hard to implement: How can it know if all the chunks are present, if the key size is not known? Problem: Also, this makes it difficult to download encrypted keys, because we only know the decrypted size, not the encrypted size, so we can't be sure how many chunks to get, and all chunks need to be downloaded before we can decrypt any of them. (Assuming we encrypt first; chunking first avoids this problem.) Problem: Does not solve concurrent uploads with different chunk sizes. # design 2 When chunking is enabled, always put a chunk number in the Key, along with the chunk size. So, SHA256-s10000-c1--xxxxxxx for the first chunk of 1 megabyte. Before any chunks are stored, write a chunkcount file, eg SHA256-s12345-c0--xxxxxxx. Note that this key is the same as the original object's key, except with chunk number set to 0. This file contains both the number of chunks, and also the chunk size used. `hasKey` downloads this file, and then verifies that each chunk is present, looking for keys with the expected chunk numbers and chunk size. This avoids problems with multiple writers using different chunk sizes, since they will be uploading to different files. Problem: In such a situation, some duplicate data might be stored, not referenced by the last chunkcount file to be written. It would not be dropped when the key was removed from the remote. Note: This design lets an attacker with logs tell the (appoximate) size of objects, by finding the small files that contain a chunk count, and correlating when that is written/read and when other files are written/read. That could be solved by padding the chunkcount key up to the size of the rest of the keys, but that's very innefficient; `hasKey` is not designed to need to download large files. # design 3 Like design 1, but add an encrypted chunk count prefix to the first object. This needs to be done in a way that does not let an attacker tell if the object has an encrypted chunk count prefix or not. This seems difficult; attacker could probably tell where the first encrypted part stops and the next encrypted part starts by looking for gpg headers, and so tell which files are the first chunks. Also, `hasKey` would need to download some or all of the first file. If all, that's a lot more expensive. If only some is downloaded, an attacker can guess that the file that was partially downloaded is the first chunk in a series, and wait for a time when it's fully downloaded to determine which are the other chunks. Problem: Two uploads of the same key from repos with different chunk sizes could lead to data loss. (Same as in design 2.) # design 4 Use key SHA256-s10000-c1--xxxxxxx for the first chunk of 1 megabyte. Instead of storing the chunk count in the special remote, store it in the git-annex branch. Look at git-annex:aaa/bbb/SHA256-s12345--xxxxxxx.log.cnk to get the chunk count and size. File format would be: ts uuid chunksize chunkcount 0|1 Where a trailing 0 means that chunk size is no longer present on the remote, and a trailing 1 means it is. For future expansion, any other value /= "0" is also accepted, meaning the chunk is present. For example, this could be used for [[deltas]], storing the checksums of the chunks. Note that a given remote uuid might have multiple lines, if a key was stored on it twice using different chunk sizes. Also note that even when this file exists for a key, the object may be stored non-chunked on the remote too. `hasKey` would check if any one (chunksize, chunkcount) is satisfied by the files on the remote. It would also check if the non-chunked key is present, as a fallback. When dropping a key from the remote, drop all logged chunk sizes. (Also drop any non-chunked key.) As long as the location log and the chunk log are committed atomically, this guarantees that no orphaned chunks end up on a remote (except any that might be left by interrupted uploads). This has the best security of the designs so far, because the special remote doesn't know anything about chunk sizes. It uses a little more data in the git-annex branch, although with care (using the same timestamp as the location log), it can compress pretty well. ## chunk then encrypt Rather than encrypting the whole object 1st and then chunking, chunk and then encrypt. Reasons: 1. If 2 repos are uploading the same key to a remote concurrently, this allows some chunks to come from one and some from another, and be reassembled without problems. 2. Also allows chunks of the same object to be downloaded from different remotes, perhaps concurrently, and again be reassembled without problems. 3. Prevents an attacker from re-assembling the chunked file using details of the gpg output. Which would expose approximate file size even if padding is being used to obscure it. Note that this means that the chunks won't exactly match the configured chunk size. gpg does compression, which might make them a lot smaller. Or gpg overhead could make them slightly larger. So `hasKey` cannot check exact file sizes. If padding is enabled, gpg compression should be disabled, to not leak clues about how well the files compress and so what kind of file it is.