diff options
-rw-r--r-- | doc/bugs/S3_bucket_uses_the_same_key_for_encryption_and_hashing.mdwn | 3 | ||||
-rw-r--r-- | doc/design.mdwn | 4 | ||||
-rw-r--r-- | doc/design/encryption.mdwn | 108 | ||||
-rw-r--r-- | doc/special_remotes/Amazon_S3.mdwn | 17 |
4 files changed, 116 insertions, 16 deletions
diff --git a/doc/bugs/S3_bucket_uses_the_same_key_for_encryption_and_hashing.mdwn b/doc/bugs/S3_bucket_uses_the_same_key_for_encryption_and_hashing.mdwn index 0ec66652e..1980a8f44 100644 --- a/doc/bugs/S3_bucket_uses_the_same_key_for_encryption_and_hashing.mdwn +++ b/doc/bugs/S3_bucket_uses_the_same_key_for_encryption_and_hashing.mdwn @@ -3,3 +3,6 @@ While using HMAC instead of "plain" hash functions is inherently more secure, it Also, ttbomk, HMAC needs two keys, not one. Are you re-using the same key twice? Compability for old buckets and support for different ones can be maintained by introducing a new option and simply copying over the encryption key's identifier into this new option should it be missing. + +> See [[design/encryption]]. I don't think this bug needs to be kept +> open. [[done]] --[[Joey]] diff --git a/doc/design.mdwn b/doc/design.mdwn new file mode 100644 index 000000000..dc66d5c80 --- /dev/null +++ b/doc/design.mdwn @@ -0,0 +1,4 @@ +git-annex's high-level design is mostly inherent in the data that it +stores in git, and alongside git. See [[internals]] for details. + +See [[encryption]] for design of encryption elements. diff --git a/doc/design/encryption.mdwn b/doc/design/encryption.mdwn new file mode 100644 index 000000000..003336dd3 --- /dev/null +++ b/doc/design/encryption.mdwn @@ -0,0 +1,108 @@ +git-annex mostly does not use encryption. Anyone with access to a git +repository can see all the filenames in it, its history, and can access +any annexed file contents. + +Encryption is needed when using [[special_remotes]] like Amazon S3, where +file content is sent to an untrusted party who does not have access to the +git repository. + +Such an encrypted remote uses strong encryption on the contents of files, +as well as the filenames. The size of the encrypted files, and access +patterns of the data, should be the only clues to what type of is stored in +such a remote. + +## encryption backends + +It makes sense to support multiple encryption backends. So, there +should be a way to tell what backend is responsible for a given filename +in an encrypted remote. (And since special remotes can also store files +unencrypted, differentiate from those as well.) + +At a high level, an encryption backend needs to support these operations: + +* Given a key/value backend key, produce and return an encrypted key. + + The same naming scheme git-annex uses for keys in regular key/value + [[backends]] can be used. So a filename for a key might be + "GPG-s12345--armoureddatahere" + +* Given a streaming source of file content, encrypt it, and send it in + a stream to an action that consumes the encrypted content. + +* Given a streaming source of encrypted content, decrypt it, and send + it in a stream to an anction that consumes the decrypted content. + +* Initialize itself. + +* Clean up. + +* Configure an encryption key to use. + +The rest of this page will describe a single encryption backend using GPG. +Probably only one will be needed, but who knows? Maybe that backend will +turn out badly designed, or some other encryptor needed. Designing +with more than one encryption backend in mind helps future-proofing. + +## encryption key management + +[[!template id=note text=""" +The basis of this scheme was originally developed by Lars Wirzenius et al +[for Obnam](http://braawi.org/obnam/encryption/). +"""]] + +Data is encrypted by gpg, using a symmetric cipher. The passphrase of the +cipher is itself checked into your git repository, encrypted using one or +more gpg public keys. This scheme allows new gpg private keys to be given +access to content that has already been stored in the remote. + +Different encrypted remotes need to be able to each use different ciphers. +There does not seem to be a benefit to allowing multiple cipers to be +used within a single remote, and it would add a lot of complexity. +Instead, if you want a new cipher, create a new S3 bucket, or whatever. +There does not seem to be much benefit to using the same cipher for +two different enrypted remotes. + +So, the encrypted cipher could just be stored with the rest of a remote's +configuration in `.git-annex/remotes.log` (see [[internals]]). When `git +annex intiremote` makes a remote, it can generate a random symmetric +cipher, and encrypt it with the specified gpg key. To allow another gpg +public key access, update the encrypted cipher to be encrypted to both gpg +keys. + +## filename enumeration + +If the names of files are encrypted, this makes it harder for +git-annex (let alone untrusted third parties!) to get a list +of the files that are stored on a given enrypted remote. This has been +a concern, and it has been considered to use a hash like HMAC, rather +than gpg encrypting filenames, to make it easier. (For git-annex, but +possibly also for attackers!) But, does git-annex really ever need to do +such an enumeration? + +Apparently not. `git annex unused --from remote` can now check for +unused data that is stored on a remote, and it does so based only on +location log data for the remote. This assumes that the location log is +kept accurately. + +What about `git annex fsck --from remote`? Such a command should be able to, +for each file in the repository, contact the encrypted remote to check +if it has the file. This can be done without enumeration, although it will +mean running gpg once per file fscked, to get the encrypted filename. + +### risks + +A risk of this scheme is that, once the symmetric cipher has been obtained, it +allows full access to all the encrypted content. This scheme does not allow +revoking a given gpg key access to the cipher, since anyone with such a key +could have already decrypted the cipher and stored a copy. + +If git-annex stores the decrypted symmetric cipher in memory, then there +is a risk that it could be intercepted from there by an attacker. Gpg +amelorates these type of risks by using locked memory. + +This design does not support obfuscating the size of files by chunking +them, as that would have added a lot of complexity, for dubious benefits. +If the untrusted party running the encrypted remote wants to know file sizes, +they could correlate chunks that are accessed together. Enctypting data +changes the original file size enough to avoid it being used as a direct +fingerprint at least. diff --git a/doc/special_remotes/Amazon_S3.mdwn b/doc/special_remotes/Amazon_S3.mdwn index 384110d1d..2cf23187d 100644 --- a/doc/special_remotes/Amazon_S3.mdwn +++ b/doc/special_remotes/Amazon_S3.mdwn @@ -37,19 +37,4 @@ only be used for public data. ** Encryption is not yet supported. ** -When encryption is enabled, all files stored in the bucket are -encrypted with gpg. Additionally, the filenames themselves are encrypted -(using HMAC). The size of the encrypted files, and -access patterns of the data, should be the only clues to what type of -data you are storing in S3. - -[[!template id=note text=""" -This scheme was originally developed by Lars Wirzenius et al -[for Obnam](http://braawi.org/obnam/encryption/). -"""]] -The data stored in S3 is encrypted by gpg with a symmetric cipher. The -passphrase of the cipher is itself checked into your git repository, -encrypted using one or more gpg public keys. This scheme allows new private -keys to be given access to a bucket's content, after the bucket is created -and is in use. The symmetric cipher is also hashed together with filenames -used in the bucket, in order to obfuscate the filenames. +See [[design/encryption]]. |