diff options
Diffstat (limited to 'doc/bugs/Glacier_remote_uploads_duplicates.mdwn')
-rw-r--r-- | doc/bugs/Glacier_remote_uploads_duplicates.mdwn | 31 |
1 files changed, 31 insertions, 0 deletions
diff --git a/doc/bugs/Glacier_remote_uploads_duplicates.mdwn b/doc/bugs/Glacier_remote_uploads_duplicates.mdwn new file mode 100644 index 000000000..bcbd94815 --- /dev/null +++ b/doc/bugs/Glacier_remote_uploads_duplicates.mdwn @@ -0,0 +1,31 @@ +### Please describe the problem. + +Other references: + +https://github.com/basak/glacier-cli/pull/19 +http://git-annex.branchable.com/special_remotes/glacier/#comment-a2b05b8dc2d640ee498d90398f02931c + +#### Background + + * Glacier doesn't support keys that the client selects, unlike S3. If you upload to Glacier, Glacier assigns a unique ID, not the client. + * Glacier does support an "archive description" which is immutable. It also provides this "archive description" in an inventory listing, together with the unique IDs. + * An "archive description" is not a unique key. It's perfectly possible to upload multiple archives to Glacier with the same "archive description". + * glacier-cli uses the "archive description" field as an upload identifier, since the unique IDs are unfriendly to users. However, since they are potentially ambiguous identifiers, it also supports disambiguation using the ID itself. See "Addressing Archives" in README.md for details. + +#### The Problem + +This what I believe is happening in the two reports referenced above. When git-annex is used without `--trust-glacier`, it can end up uploading the same data multiple times. From git-annex's point of view, it cannot verify that the data is already in Glacier, so it uploads again, expecting an overwrite operation if the key is already in Glacier. Since glacier-cli maps the key to an "archive description" that can be duplicated, this is not what happens. Instead, a second archive is uploaded. + +When git-annex later does a "checkpresent" operation, glacier-cli fails. This is because the request is ambiguous, since there are two archives in Glacier with the same "key". The error message could be better here, but I believe that the behaviour is correct. + +#### Discussion + +glacier-cli can find out what data Glacier claims to have using an inventory retrieval. However, this retrieval takes about four hours and can be out of date (eg. if someone else recently deleted the archive from another client). Thus, I can understand git-annex's desire not to trust this data or a cache of it. + +However, whatever we do, it is impossible to map an "upload or overwrite on key X" type command to Glacier. We'll always end up with duplicates. Even if git-annex stored the Glacier archive IDs, there is no API to replace an existing archive with the same ID, and inventories are out of date even before we retrieve them. + +#### Workaround + +If the problem is as I think it is, always applying `--trust-glacier` should prevent the problem from occurring in most cases, since git-annex will run "checkpresent" and glacier-cli will confirm that the archive exists. + +To fix the problem after it has occurred, it should be sufficient to delete duplicates using glacier-cli, since they _should_ be identical to each other. Some enhancement of the `glacier-cli archive list` command would help here. |