summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGravatar Joey Hess <joey@kitenet.net>2014-01-22 15:55:44 -0400
committerGravatar Joey Hess <joey@kitenet.net>2014-01-22 15:55:44 -0400
commit9a5de318d15f0234080a6f0bd802fe073cf57334 (patch)
treedf77d2f8474a4bc36b316d0ac28c5af886b9aed4
parentcc366b8241cfc3e41252ecd2624332c15da03377 (diff)
preferred content stability analysis
-rw-r--r--doc/design/preferred_content.mdwn21
-rw-r--r--doc/todo/Limit_file_revision_history.mdwn30
2 files changed, 49 insertions, 2 deletions
diff --git a/doc/design/preferred_content.mdwn b/doc/design/preferred_content.mdwn
new file mode 100644
index 000000000..3972b8b58
--- /dev/null
+++ b/doc/design/preferred_content.mdwn
@@ -0,0 +1,21 @@
+The [[preferred_content]] expressions didn't have a design document, but
+it's a small non-turing complete DSL for expressing which objects a
+repository prefers to contain.
+
+One thing that needs to be written down though is the stability analysis
+that must be done of preferred content expressions.
+
+It's important that when a set of repositories all look at one-another's
+preferred content expressions, and copy/move/drop objects to satisfy them,
+they end up at a steady state. So, a given preferred content expression
+should ideally evaluate to the same answer for each key, from the
+perspective of each repository.
+
+The best way to ensure that is the case is to only use terms in preferred
+content expressions that rely on state that is shared between all
+repositories. So, state in the git-annex branch, or the master branch
+(assuming all repositories have master checked out).
+
+Since git is eventually consistent, there might be disagreements about
+which object belongs where, but once consistency is reached, things will
+settle down.
diff --git a/doc/todo/Limit_file_revision_history.mdwn b/doc/todo/Limit_file_revision_history.mdwn
index 593e93013..9cdfe5e9b 100644
--- a/doc/todo/Limit_file_revision_history.mdwn
+++ b/doc/todo/Limit_file_revision_history.mdwn
@@ -42,7 +42,8 @@ Finally, how to specify a feature request for git-annex?
> to hang on to unused content.
> Something like "unused=true" I suppose, because not having a parameter
> would complicate preferred content parsing, and I cannot think
-> of a useful parameter.
+> of a useful parameter. (It cannot be a timestamp, because there's
+> no way repos can agree on about when a key became unused.)
> * In order to quickly match that terminal, the Annex monad will need
> to keep a Set of unused Keys. This should only be loaded on demand.
> NB: There is some potential for a great many unused Keys to cause
@@ -57,7 +58,7 @@ Finally, how to specify a feature request for git-annex?
> for most repos. Note that the assistant could also notice on the
> fly when files are removed and mark their keys as unused if that was
> the last associated file. (Only currently possible in direct mode.)
-> * It makes sense for the
+> * After scanning for unused files, it makes sense for the
> assistant to queue transfers of unused files to any remotes that
> do want them (eg, backup remotes). If the files can successfully be
> sent to a remote, that will lead to them being dropped locally as
@@ -70,6 +71,7 @@ Finally, how to specify a feature request for git-annex?
> time stamp of the object; we could use the mtime of the .map file,
> that that's direct mode only and may be replaced with a database
> later. Seems best to just keep a unused log file with timestamps.
+> **done**
> * After the assistant scans for unused files, if annex.expireunused
> is not set, and there is some significant quantity of unused files
> (eg, more than 1000, or more than 1 gb, or more than the amount of
@@ -87,3 +89,27 @@ Finally, how to specify a feature request for git-annex?
> might be. For example, if a file is replicated to 2 clients, and one
> client directly edits it, or deletes it, it loses the old version,
> but the other client will still be storing that old version.
+>
+> ## Stability analysis for unused= in preferred content expressions
+>
+> This is tricky, because two repos that are otherwise entirely
+> in sync may have differing opinons about whether a key is unused,
+> depending on when each last scanned for unused keys.
+>
+> So, this preferred content terminal is *not stable*.
+> It may be possible to write preferred content expressions
+> that constantly moved such keys around without reaching a steady state.
+>
+> Example:
+>
+> A and B are clients directly connected, and both also connected
+> to BACKUP.
+>
+> A deletes F. B syncs with A, and runs unused check; decides F
+> is unused. B sends F to BACKUP. B will then think A doesn't want F,
+> and will drop F from A. Next time A runs a full transfer scan, it will
+> *not* find F (because the file was deleted!). So it won't get F back from
+> BACKUP.
+>
+> So, it looks like the fact that unused files are not going to be
+> looked for on the full transfer scan seems to make this work out ok.