From 9a5de318d15f0234080a6f0bd802fe073cf57334 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 22 Jan 2014 15:55:44 -0400 Subject: preferred content stability analysis --- doc/design/preferred_content.mdwn | 21 +++++++++++++++++++++ doc/todo/Limit_file_revision_history.mdwn | 30 ++++++++++++++++++++++++++++-- 2 files changed, 49 insertions(+), 2 deletions(-) create mode 100644 doc/design/preferred_content.mdwn diff --git a/doc/design/preferred_content.mdwn b/doc/design/preferred_content.mdwn new file mode 100644 index 000000000..3972b8b58 --- /dev/null +++ b/doc/design/preferred_content.mdwn @@ -0,0 +1,21 @@ +The [[preferred_content]] expressions didn't have a design document, but +it's a small non-turing complete DSL for expressing which objects a +repository prefers to contain. + +One thing that needs to be written down though is the stability analysis +that must be done of preferred content expressions. + +It's important that when a set of repositories all look at one-another's +preferred content expressions, and copy/move/drop objects to satisfy them, +they end up at a steady state. So, a given preferred content expression +should ideally evaluate to the same answer for each key, from the +perspective of each repository. + +The best way to ensure that is the case is to only use terms in preferred +content expressions that rely on state that is shared between all +repositories. So, state in the git-annex branch, or the master branch +(assuming all repositories have master checked out). + +Since git is eventually consistent, there might be disagreements about +which object belongs where, but once consistency is reached, things will +settle down. diff --git a/doc/todo/Limit_file_revision_history.mdwn b/doc/todo/Limit_file_revision_history.mdwn index 593e93013..9cdfe5e9b 100644 --- a/doc/todo/Limit_file_revision_history.mdwn +++ b/doc/todo/Limit_file_revision_history.mdwn @@ -42,7 +42,8 @@ Finally, how to specify a feature request for git-annex? > to hang on to unused content. > Something like "unused=true" I suppose, because not having a parameter > would complicate preferred content parsing, and I cannot think -> of a useful parameter. +> of a useful parameter. (It cannot be a timestamp, because there's +> no way repos can agree on about when a key became unused.) > * In order to quickly match that terminal, the Annex monad will need > to keep a Set of unused Keys. This should only be loaded on demand. > NB: There is some potential for a great many unused Keys to cause @@ -57,7 +58,7 @@ Finally, how to specify a feature request for git-annex? > for most repos. Note that the assistant could also notice on the > fly when files are removed and mark their keys as unused if that was > the last associated file. (Only currently possible in direct mode.) -> * It makes sense for the +> * After scanning for unused files, it makes sense for the > assistant to queue transfers of unused files to any remotes that > do want them (eg, backup remotes). If the files can successfully be > sent to a remote, that will lead to them being dropped locally as @@ -70,6 +71,7 @@ Finally, how to specify a feature request for git-annex? > time stamp of the object; we could use the mtime of the .map file, > that that's direct mode only and may be replaced with a database > later. Seems best to just keep a unused log file with timestamps. +> **done** > * After the assistant scans for unused files, if annex.expireunused > is not set, and there is some significant quantity of unused files > (eg, more than 1000, or more than 1 gb, or more than the amount of @@ -87,3 +89,27 @@ Finally, how to specify a feature request for git-annex? > might be. For example, if a file is replicated to 2 clients, and one > client directly edits it, or deletes it, it loses the old version, > but the other client will still be storing that old version. +> +> ## Stability analysis for unused= in preferred content expressions +> +> This is tricky, because two repos that are otherwise entirely +> in sync may have differing opinons about whether a key is unused, +> depending on when each last scanned for unused keys. +> +> So, this preferred content terminal is *not stable*. +> It may be possible to write preferred content expressions +> that constantly moved such keys around without reaching a steady state. +> +> Example: +> +> A and B are clients directly connected, and both also connected +> to BACKUP. +> +> A deletes F. B syncs with A, and runs unused check; decides F +> is unused. B sends F to BACKUP. B will then think A doesn't want F, +> and will drop F from A. Next time A runs a full transfer scan, it will +> *not* find F (because the file was deleted!). So it won't get F back from +> BACKUP. +> +> So, it looks like the fact that unused files are not going to be +> looked for on the full transfer scan seems to make this work out ok. -- cgit v1.2.3