aboutsummaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authorGravatar Joey Hess <joeyh@joeyh.name>2017-09-07 15:53:34 -0400
committerGravatar Joey Hess <joeyh@joeyh.name>2017-09-07 15:53:34 -0400
commita39d0a51b48a02afd7a9cd50725b753d96e9e145 (patch)
tree33909c5a6920d1ac0ab7aa22452a7d15e3338f9f /doc
parentd928368d8a698abcaa06b4ee17764f5514521ffc (diff)
parent6b62556049481d8ed9b75a1642f3422a79c55133 (diff)
Merge branch 'export'
Diffstat (limited to 'doc')
-rw-r--r--doc/design/exporting_trees_to_special_remotes.mdwn169
-rw-r--r--doc/design/external_special_remote_protocol.mdwn54
-rw-r--r--doc/git-annex-export.mdwn64
-rw-r--r--doc/git-annex-import.mdwn2
-rw-r--r--doc/git-annex.mdwn6
-rw-r--r--doc/internals.mdwn19
-rw-r--r--doc/special_remotes/directory.mdwn4
-rw-r--r--doc/todo/export.mdwn12
8 files changed, 273 insertions, 57 deletions
diff --git a/doc/design/exporting_trees_to_special_remotes.mdwn b/doc/design/exporting_trees_to_special_remotes.mdwn
index ce7431141..6e7cc68db 100644
--- a/doc/design/exporting_trees_to_special_remotes.mdwn
+++ b/doc/design/exporting_trees_to_special_remotes.mdwn
@@ -15,13 +15,13 @@ when they want to export a tree. (It would also be possible to drop all content
from an existing special remote and reuse it, but there does not seem much
benefit in doing so.)
-Add a new `initremote` configuration `exporttree=true`, that cannot be
+Add a new `initremote` configuration `exporttree=yes`, that cannot be
changed by `enableremote`:
- git annex initremote myexport type=... exporttree=true
+ git annex initremote myexport type=... exporttree=yes
-It does not make sense to encrypt an export, so exporttree=true requires
-(and can even imply) encryption=false.
+It does not make sense to encrypt an export, so exporttree=yes requires
+encryption=none.
Note that the particular tree to export is not specified yet. This is
because the tree that is exported to a special remote may change.
@@ -69,11 +69,6 @@ To efficiently update an export, git-annex can diff the tree
that was exported with the new tree. The naive approach is to upload
new and modified files and remove deleted files.
-Note that a file may have been partially uploaded to an export, and then
-the export updated to a tree without that file. So, need to try to delete
-all removed files, even if location tracking does not say that the special
-remote contains them.
-
With rename detection, if the special remote supports moving files,
more efficient updates can be done. It gets complicated; consider two files
that swap names.
@@ -81,33 +76,6 @@ that swap names.
If the special remote supports copying files, that would also make some
updates more efficient.
-## resuming exports
-
-Resuming an interrupted export needs to work well.
-
-There are two cases here:
-
-1. Some of the files in the tree have been uploaded; others have not.
-2. A file has been partially uploaded.
-
-These two cases need to be disentangled somehow in order to handle
-them. One way is to use the location log as follows:
-
-* Before a file is uploaded, look up what key is currently exported
- using that filename. If there is one, update the location log,
- saying it's not present in the special remote.
-* Upload the file.
-* Update the location log for the newly exported key.
-
-Note that this method does not allow resuming a partial upload by appending to
-a file, because we don't know if the file actually started to be uploaded, or
-if the file instead still has the old key's content. Instead, the whole
-file needs to be re-uploaded.
-
-Alternative: Keep an index file that's the current state of the export.
-See comment #4 of [[todo/export]]. Not sure if that works? Perhaps it
-would be overkill if it's only used to support resuming partial uploads.
-
## changes to special remote interface
This needs some additional methods added to special remotes, and to
@@ -115,6 +83,10 @@ the [[external_special_remote_protocol]].
Here's the changes to the latter:
+* `EXPORTSUPPORTED`
+ Used to check if a special remote supports exports. The remote
+ responds with either `EXPORTSUPPORTED-SUCCESS` or
+ `EXPORTSUPPORTED-FAILURE`
* `EXPORT Name`
Comes immediately before each of the following requests,
specifying the name of the exported file. It will be in the form
@@ -123,6 +95,9 @@ Here's the changes to the latter:
* `TRANSFEREXPORT STORE|RETRIEVE Key File`
Requests the transfer of a File on local disk to or from the previously
provided Name on the special remote.
+ Note that it's important that, while a file is being stored,
+ CHECKPRESENTEXPORT not indicate it's present until all the data has
+ been transferred.
The remote responds with either `TRANSFER-SUCCESS` or
`TRANSFER-FAILURE`, and a remote where exports do not make sense
may always fail.
@@ -139,9 +114,8 @@ Here's the changes to the latter:
* `RENAMEEXPORT Key NewName`
Requests the remote rename a file stored on it from the previously
provided Name to the NewName.
- The remote responds with `RENAMEEXPORT-SUCCESS`,
- `RENAMEEXPORT-FAILURE`, or with `RENAMEEXPORT-UNSUPPORTED` if an efficient
- rename cannot be done.
+ The remote responds with `RENAMEEXPORT-SUCCESS` or with
+ `RENAMEEXPORT-FAILURE` if an efficient rename cannot be done.
To support old external special remote programs that have not been updated
to support exports, git-annex will need to handle an `ERROR` response
@@ -162,19 +136,19 @@ key/value stores. The content of a file can change, and if multiple
repositories can export a special remote, they can be out of sync about
what files are exported to it.
-To avoid such problems, when updating an exported file on a special remote,
-the key could be recorded there too. But, this would have to be done
-atomically, and checked atomically when downloading the file. Special
-remotes lack atomicity guarantees for file storage, let alone for file
-retrieval.
-
-Possible solution: Make exporttree=true cause the special remote to
+Possible solution: Make exporttree=yes cause the special remote to
be untrusted, and rely on annex.verify to catch cases where the content
of a file on a special remote has changed. This would work well enough
except for when the WORM or URL backend is used. So, prevent the user
from exporting such keys. Also, force verification on for such special
remotes, don't let it be turned off.
+The same file contents may be in a treeish multiple times under different
+filenames. That complicates using location tracking. One file may have been
+exported and the other not, and location tracking says that the content
+is present in the export. A sqlite database is needed to keep track of
+this.
+
## recording exported filenames in git-annex branch
In order to download the content of a key from a file exported
@@ -229,10 +203,101 @@ In this case, git-annex knows both exported trees. Have the user provide
a tree that resolves the conflict as they desire (it could be the same as
one of the exported trees, or some merge of them or an entirely new tree).
The UI to do this can just be another `git annex export $tree --to remote`.
-To resolve, diff each exported tree in turn against the resolving tree. If a
-file differs, re-export that file. In some cases this will do unncessary
-re-uploads, but it's reasonably efficient.
+To resolve, diff each exported tree in turn against the resolving tree
+and delete all files that differ. Then, upload all missing files.
+
+## when to update export.log for efficient resuming of exports
+
+When should `export.log` be updated? Possibilities:
+
+* Before performing any work, to set the goal.
+* After the export is fully successful, to record the current state.
+* After some mid-point.
+
+Lots of things could go wrong during an export. A file might fail to be
+transferred or only part of it be transferred; a file's content might not
+be present to transfer at all. The export could be interrupted part way.
+Updating the export.log at the right point in time is important to handle
+these cases efficiently.
+
+If the export.log is updated first, then it's only a goal and does not tell
+us what's been done already.
+
+If the export.log is updated only after complete success, then the common
+case of some files not having content locally present will prevent it from
+being updated. When we resume, we again don't know what's been done
+already.
+
+If the export.log is updated after deleting any files from the
+remote that are not the same in the new treeish as in the old treeish,
+and as long as TRANSFEREXPORT STORE is atomic, then when resuming we can
+trust CHECKPRESENTEXPORT to only find files that have the correct content
+for the current treeish. (Unless a conflicting export was made from
+elsewhere, but in that case, the conflict resolution will have to fix up
+later.)
+
+## handling renames efficiently
+
+To handle two files that swap names, a temp name is required.
+
+Difficulty with a temp name is picking a name that won't ever be used by
+any exported file.
+
+Interrupted exports also complicate this. While a name could be picked that
+is in neither the old nor the new tree, an export could be interrupted,
+leaving the file at the temp name. There needs to be something to clean
+that up when the export is resumed, even if it's resumed with a different
+tree.
-The documentation should suggest strongly only exporting to a given special
-remote from a single repository, or having some other rule that avoids
-export conflicts.
+Could use something like ".git-annex-tmp-content-$key" as the temp name.
+This hides it from casual view, which is good, and it's not depedent on the
+tree, so no state needs to be maintained to clean it up. Also, using the
+key in the name simplifies calculation of complicated renames (eg, renaming
+A to B, B to C, C to A)
+
+Export can first try to rename all files that are deleted/modified
+to their key's temp name (falling back to deleting since not all
+special remotes support rename), and then, in a second pass, rename
+from the temp name to the new name. Followed by deleting the temp name
+of all keys whose files are deleted in the diff. That is more renames and
+deletes than strictly necessary, but it will statelessly clean up
+an interruped export as long as it's run again with the same new tree.
+
+But, an export of tree B should clean up after
+an interrupted export of tree A. Some state is needed to handle this.
+Before starting the export of tree A, record it somewhere. Then when
+resuming, diff A..B, and delete the temp names of the keys in the
+diff. (Can't rename here, because we don't know what was the content
+of a file when an export was interrupted.)
+
+So, before an export does anything, need to record the tree that's about
+to be exported to export.log, not as an exported tree, but as a goal.
+Then on resume, the temp files for that can be cleaned up.
+
+## renames and export conflicts
+
+What is there's an export conflict going on at the same time that a file
+in the export gets renamed?
+
+Suppose that there are two git repos A and B, each exporting to the same
+remote. A and B are not currently communicating. A exports T1 which
+contains F. B exports T2, which has a different content for F.
+
+Then A exports T3, which renames F to G. If that rename is done
+on the remote, then A will think it's successfully exported T3,
+but G will have F's content from T2, not from T1.
+
+When A and B reconnect, the export conflict will be detected.
+To resolve the export conflict, it says above to:
+
+> To resolve, diff each exported tree in turn against the resolving tree
+> and delete all files that differ. Then, upload all missing files.
+
+Assume that the resolving tree is T3. So B's export of T2 is diffed against
+T3. F differs and is deleted (no change). G differs and is deleted,
+which fixes up the problem that the wrong content was renamed to G.
+G is missing so gets uploaded.
+
+So, this works, as long as "delete all files that differ" means it
+deletes both old and new files. And as long as conflict resolution does not
+itself stash away files in the temp name for later renaming.
diff --git a/doc/design/external_special_remote_protocol.mdwn b/doc/design/external_special_remote_protocol.mdwn
index 87a838bd4..8a34bb2d7 100644
--- a/doc/design/external_special_remote_protocol.mdwn
+++ b/doc/design/external_special_remote_protocol.mdwn
@@ -43,7 +43,8 @@ the version of the protocol it is using.
Once it knows the version, git-annex will generally
send a message telling the special remote to start up.
-(Or it might send a INITREMOTE, so don't hardcode this order.)
+(Or it might send an INITREMOTE or EXPORTSUPPORTED,
+so don't hardcode this order.)
PREPARE
@@ -102,7 +103,7 @@ The following requests *must* all be supported by the special remote.
So any one-time setup tasks should be done idempotently.
* `PREPARE`
Tells the remote that it's time to prepare itself to be used.
- Only INITREMOTE can come before this.
+ Only INITREMOTE or EXPORTSUPPORTED can come before this.
* `TRANSFER STORE|RETRIEVE Key File`
Requests the transfer of a key. For STORE, the File is the file to upload;
for RETRIEVE the File is where to store the download.
@@ -143,6 +144,46 @@ replying with `UNSUPPORTED-REQUEST` is acceptable.
network access.
This is not needed when `SETURIPRESENT` is used, since such uris are
automatically displayed by `git annex whereis`.
+* `EXPORTSUPPORTED`
+ Used to check if a special remote supports exports. The remote
+ responds with either `EXPORTSUPPORTED-SUCCESS` or
+ `EXPORTSUPPORTED-FAILURE`. Note that this request may be made before
+ or after `PREPARE`.
+* `EXPORT Name`
+ Comes immediately before each of the following export-related requests,
+ specifying the name of the exported file. It will be in the form
+ of a relative path, and may contain path separators, whitespace,
+ and other special characters.
+* `TRANSFEREXPORT STORE|RETRIEVE Key File`
+ Requests the transfer of a File on local disk to or from the previously
+ provided Name on the special remote.
+ Note that it's important that, while a file is being stored,
+ CHECKPRESENTEXPORT not indicate it's present until all the data has
+ been transferred.
+ The remote responds with either `TRANSFER-SUCCESS` or
+ `TRANSFER-FAILURE`, and a remote where exports do not make sense
+ may always fail.
+* `CHECKPRESENTEXPORT Key`
+ Requests the remote to check if the previously provided Name is present
+ in it.
+ The remote responds with `CHECKPRESENT-SUCCESS`, `CHECKPRESENT-FAILURE`,
+ or `CHECKPRESENT-UNKNOWN`.
+* `REMOVEEXPORT Key`
+ Requests the remote to remove content stored by `TRANSFEREXPORT`
+ with the previously provided Name.
+ The remote responds with either `REMOVE-SUCCESS` or
+ `REMOVE-FAILURE`.
+ If the content was already not present in the remote, it should
+ respond with `REMOVE-SUCCESS`.
+* `RENAMEEXPORT Key NewName`
+ Requests the remote rename a file stored on it from the previously
+ provided Name to the NewName.
+ The remote responds with `RENAMEEXPORT-SUCCESS` or
+ `RENAMEEXPORT-FAILURE`.
+
+To support old external special remote programs that have not been updated
+to support exports, git-annex will need to handle an `ERROR` response
+when using any of the above.
More optional requests may be added, without changing the protocol version,
so if an unknown request is seen, reply with `UNSUPPORTED-REQUEST`.
@@ -210,6 +251,15 @@ while it's handling a request.
stored in the special remote.
* `WHEREIS-FAILURE`
Indicates that no location is known for a key.
+* `EXPORTSUPPORTED-SUCCESS`
+ Indicates that it makes sense to use this special remote as an export.
+* `EXPORTSUPPORTED`
+ Indicates that it does not make sense to use this special remote as an
+ export.
+* `RENAMEEXPORT-SUCCESS`
+ Indicates that a `RENAMEEXPORT` was done successfully.
+* `RENAMEEXPORT-FAILURE`
+ Indicates that a `RENAMEEXPORT` failed for whatever reason.
* `UNSUPPORTED-REQUEST`
Indicates that the special remote does not know how to handle a request.
diff --git a/doc/git-annex-export.mdwn b/doc/git-annex-export.mdwn
new file mode 100644
index 000000000..72319a8fc
--- /dev/null
+++ b/doc/git-annex-export.mdwn
@@ -0,0 +1,64 @@
+# NAME
+
+git-annex export - export content to a remote
+
+# SYNOPSIS
+
+git annex export `treeish --to remote`
+
+# DESCRIPTION
+
+Use this command to export a tree of files from a git-annex repository.
+
+Normally files are stored on a git-annex special remote named by their
+keys. That is great for reliable data storage, but your filenames are
+obscured. Exporting replicates the tree to the special remote as-is.
+
+Mixing key/value storage and exports in the same remote would be a mess and
+so is not allowed. You have to configure a special remote with
+`exporttree=yes` when initially setting it up with
+[[git-annex-initremote]](1).
+
+Repeated exports are done efficiently, by diffing the old and new tree,
+and transferring only the changed files.
+
+Exports can be interrupted and resumed. However, partially uploaded files
+will be re-started from the beginning.
+
+Once content has been exported to a remote, commands like `git annex get`
+can download content from there the same as from other remotes. However,
+since an export is not a key/value store, git-annex has to do more
+verification of content downloaded from an export. Some types of keys,
+that are not based on checksums, cannot be downloaded from an export.
+And, git-annex will never trust an export to retain the content of a key.
+
+# EXPORT CONFLICTS
+
+If two different git-annex repositories are both exporting different trees
+to the same special remote, it's possible for an export conflict to occur.
+This leaves the special remote with some files from one tree, and some
+files from the other. Files in the special remote may have entirely the
+wrong content as well.
+
+It's not possible for git-annex to detect when making an export will result
+in an export conflict. The best way to avoid export conflicts is to either
+only ever export to a special remote from a single repository, or to have a
+rule about the tree that you export to the special remote. For example, if
+you always export origin/master after pushing to origin, then an export
+conflict can't happen.
+
+An export conflict can only be detected after the two git repositories
+that produced it get back in sync. Then the next time you run `git annex
+export`, it will detect the export conflict, and resolve it.
+
+# SEE ALSO
+
+[[git-annex]](1)
+
+[[git-annex-initremote]](1)
+
+# AUTHOR
+
+Joey Hess <id@joeyh.name>
+
+Warning: Automatically converted into a man page by mdwn2man. Edit with care.
diff --git a/doc/git-annex-import.mdwn b/doc/git-annex-import.mdwn
index 22b3c3941..3684505b6 100644
--- a/doc/git-annex-import.mdwn
+++ b/doc/git-annex-import.mdwn
@@ -96,6 +96,8 @@ instead of to the annex.
[[git-annex-add]](1)
+[[git-annex-export]](1)
+
# AUTHOR
Joey Hess <id@joeyh.name>
diff --git a/doc/git-annex.mdwn b/doc/git-annex.mdwn
index 14a787219..544baafa1 100644
--- a/doc/git-annex.mdwn
+++ b/doc/git-annex.mdwn
@@ -158,6 +158,12 @@ subdirectories).
See [[git-annex-importfeed]](1) for details.
+* `export treeish --to remote`
+
+ Export content to a remote.
+
+ See [[git-annex-export]](1) for details.
+
* `undo [filename|directory] ...`
Undo last change to a file or directory.
diff --git a/doc/internals.mdwn b/doc/internals.mdwn
index 7d39b1068..ccf1e09b6 100644
--- a/doc/internals.mdwn
+++ b/doc/internals.mdwn
@@ -185,10 +185,23 @@ content expression.
Tracks what trees have been exported to special remotes by
[[git-annex-export]](1).
-Each line starts with a timestamp, then the uuid of the special remote,
-followed by the sha1 of the tree that was exported to that special remote.
+Each line starts with a timestamp, then the uuid of the repository
+that exported to the special remote, followed by the sha1 of the tree
+that was exported, and then by the uuid of the special remote.
-(The exported tree is also grafted into the git-annex branch, at
+There can also be subsequent sha1s, of trees that have started to be
+exported but whose export is not yet complete. The sha1 of the exported
+tree can be the empty tree (4b825dc642cb6eb9a060e54bf8d69288fbee4904)
+in order to record the beginning of the first export.
+
+For example:
+
+ 1317929100.012345s e605dca6-446a-11e0-8b2a-002170d25c55 4b825dc642cb6eb9a060e54bf8d69288fbee4904 26339d22-446b-11e0-9101-002170d25c55 bb08b1abd207aeecccbc7060e523b011d80cb35b
+ 1317929100.012345s e605dca6-446a-11e0-8b2a-002170d25c55 bb08b1abd207aeecccbc7060e523b011d80cb35b 26339d22-446b-11e0-9101-002170d25c55
+ 1317929189.157237s e605dca6-446a-11e0-8b2a-002170d25c55 bb08b1abd207aeecccbc7060e523b011d80cb35b 26339d22-446b-11e0-9101-002170d25c55 7c7af825782b7c8706039b855c72709993542be4
+ 1317923000.251111s e605dca6-446a-11e0-8b2a-002170d25c55 7c7af825782b7c8706039b855c72709993542be4 26339d22-446b-11e0-9101-002170d25c55
+
+(The trees are also grafted into the git-annex branch, at
`export.tree`, to prevent git from garbage collecting it. However, the head
of the git-annex branch should never contain such a grafted in tree;
the grafted tree is removed in the same commit that updates `export.log`.)
diff --git a/doc/special_remotes/directory.mdwn b/doc/special_remotes/directory.mdwn
index 5584f31f3..70610c66d 100644
--- a/doc/special_remotes/directory.mdwn
+++ b/doc/special_remotes/directory.mdwn
@@ -31,6 +31,10 @@ remote:
Do not use for new remotes. It is not safe to change the chunksize
setting of an existing remote.
+* `exporttree` - Set to "yes" to make this special remote usable
+ by [[git-annex-export]]. It will not be usable as a general-purpose
+ special remote.
+
Setup example:
# git annex initremote usbdrive type=directory directory=/media/usbdrive/ encryption=none
diff --git a/doc/todo/export.mdwn b/doc/todo/export.mdwn
index e729b0cf1..c4e57bd1c 100644
--- a/doc/todo/export.mdwn
+++ b/doc/todo/export.mdwn
@@ -14,3 +14,15 @@ Would this be able to reuse the existing `storeKey` interface, or would
there need to be a new interface in supported remotes?
--[[Joey]]
+
+Work is in progress. Todo list:
+
+* `git annex get --from export` works in the repo that exported to it,
+ but in another repo, the export db won't be populated, so it won't work.
+ Maybe just show a useful error message in this case?
+ However, exporting from one repository and then trying to update the
+ export from another repository also doesn't work right, because the
+ export database is not populated. So, seems that the export database needs
+ to get populated based on the export log in these cases.
+* Support export to aditional special remotes (S3 etc)
+* Support export to external special remotes.