every idea that came to me in my sleep. there were rather a lot of them

author: Joey Hess <joey@kitenet.net> 2014-02-11 11:37:53 -0400
committer: Joey Hess <joey@kitenet.net> 2014-02-11 11:37:53 -0400
commit: b7cce8dd825b39ea56395c06858b50112459bf1f (patch)
tree: 8a3f088c50df1fd6a2824ec3cf2cf03479819c4c
parent: dbd2093acce3eaabaf678286e634bc7f37798d77 (diff)
1 files changed, 100 insertions, 21 deletions
diff --git a/doc/design/metadata.mdwn b/doc/design/metadata.mdwn
index 8e409f7d6..be3627d5f 100644
--- a/doc/design/metadata.mdwn
+++ b/doc/design/metadata.mdwn
@@ -4,11 +4,12 @@
 
 Attach an arbitrary set of metadata to a key.
 
-Metadata can be tags, but it can also be fields with values (ie, date=xxx,
-conference=yyy).
-
 Store in git-annex branch, next to location log files.
 
+Metadata can be tags, but it can also be fields with values (ie, date=xxx,
+conference=yyy). Fields can have multiple values, for example
+multiple authors.
+
 Storage needs to support union merging, including removing tags, and
 changing values.
 
@@ -20,6 +21,7 @@ when adding it.
 Could also automatically attach permissions.
 
 A git hook could be run by git annex add to gather more metadata.
+For example, by examining MP3 metadata.
 
 Also auto adds metadata when adding files to filter branches. See below.
 
@@ -28,40 +30,62 @@ Also auto adds metadata when adding files to filter branches. See below.
 From the ctime, some additional 
 metadata is derived, at least year=yyyy and probably also month, etc.
 
-Should be a general mechanism for this.
+This is probably not stored anywhere. It's computed on demand by a pure
+function from the other metadata.
+
+From the set of tags a file has, a "tag" field is derived, which has the
+value of each tag. See example below.
+
+Should be a general mechanism for this. (It probably generalizes to
+sql queries if we want to go that far.)
 
 # filtered branches
 
 `git annex filter year=2014 talk` should create a new branch
-filtered/talk/year=2014 containing only files tagged with that, and
+filtered/year=2014/talk containing only files tagged with that, and
 have git check it out. In this example, all files appear in top level
 directory of repo; no subdirs.
 
 `git annex fadd haskell` switches to branch
-filtered/haskell/talk/year=2014 with only the haskell talks.
+filtered/year=2014/talk/haskell with only the haskell talks.
 
 `git annex fadd year=2013 year=2012` switches to branch
-filtered/haskell/talk/year=2012,2013,2014. This has subdirectories 2012,
+filtered/year=2012,2013,2014/talk/haskell. This has subdirectories 2012,
 2013 and 2014 with the matching talks.
 
+Patterns can be used in both the values of fields, and in matching tags.
+So, `year=20*` could be used to match years, and `foo/*` matches any
+tag in the foo namespace. Or even `*` to match *all* tags.
+
 `git annex frm haskell` switches to
-filtered/talk/year=2012,2013,2014, which has all available talks in it.
+filtered/year=2012,2013,2014/talk, which has all available talks in it.
 
-`git annex filteradd conference=fosdem conference=icfp` switches to branch
-filtered/conference=fosdem,icfp/talk/year=2012,2013,2014. Now we need
-to either nest the subdirectories, or make fosdem-2014, icfp-2013, etc.
-May need an option to choose this. Note that user may prefer to have year
-first or conference first, so may need an option for that as well.
+`git annex fadd conference=fosdem conference=icfp` switches to branch
+filtered/year=2012,2013,2014/talk/conference=fosdem,icfp. Now there
+are nested subdirectories. They follow the format of the branch,
+so 2013/icfp, 2014/fosdem, etc.
 
-Note that old filter branches can be deleted when switching to a new one.
-There is no need to retain them. Unless the user has committed non
-git-annexed files to them, In which case, urk.
+`git annex filter tag=haskell,debian` uses the "tag" field that is
+automatically derived from the set of tags. So this yields a branch
+with hakell and debian subdirectories, containing the files tagged with
+either. 
 
-These command should probably refuse to do anything if run from within a
-subdir of the work tree that would get deleted by checking out the new
-filtered branch.
+To see all tags, `git annex filter tag=*` !
 
-# operations while on filter branch
+Files not matching the filter can be included, by using 
+`git annex filter --unmatched=other`. That puts all such files into
+the subdirectory other.
+
+Sometimes you want to see files that do not match a tag, while still
+getting subdirectories for 
+
+Note that old filter branches can be deleted when switching to a new one.
+There is no need to retain them. Unless the user has committed non-annexed
+files to them, In which case, urk. The only reason to use specially named
+filtered branches is because it makes self-documenting how the repository
+is currently filtered.
+
+## operations while on filtered branch
 
 * If files are removed and git commit called, git-annex should remove the
   relevant metadata from the files. **possibly** It's not clear that
@@ -69,6 +93,8 @@ filtered branch.
   branch (especially if it's derived metadata like the year).
   Also, this is not usable in direct mode because deleting the
   file.. actually deletes it.
+* If a file is moved into a new subdirectory while in a filter branch,
+  a tag is added with the subdir name. This allows on the fly tagging.
 * `git annex sync` should avoid pushing out the filter branch, but
   it should check if there are changes to the metadata pulled in, and update
   the branch to reflect them.
@@ -85,6 +111,11 @@ same tree of files filter would. The user can then commit that if desired.
 Or, they could run additional commands like `git annex fadd` to refine the
 tree of files in the subdir.
 
+Metadata can be used for configuring numcopies. One way would be a
+numcopies=n value attached to a file. But perhaps better would be to make
+the numcopies.log allow configuring numcopies based on which files have
+other metadata.
+
 Other programs could query git-annex for the metadata of files in the work
 tree, and do whatever it wants with it.
 
@@ -97,11 +128,59 @@ want to see.
 * Could use filename metadata for the key, recorded by git-annex add (which
   may not correspond to filenames being used in regular git branches like
   master for the key).
-* Couod use the .map files to get a filename, but this is somewhat
+* Could use the .map files to get a filename, but this is somewhat
   arbitrary (.map can contain multiple filenames), and is only
   currently supported in direct mode.
 
+Note that any of these filenames can in theory conflict. May need to use
+`.variant-*` like sync does on conflict to allow 2 files with same name in
+same filtered branch.
+
 # efficient metadata lookup
 
 Looking up metadata for filtering so far requires traversing all keys in
 the git-annex branch. This is slow. A fast cache is needed.
+
+# direct mode issues
+
+Checking out a filter branch can result in any number of copies of a file
+appearing in different directories. No problem in indirect mode, but
+in direct mode these are real, expensive copies.
+
+But, it's worth supporting direct mode!
+
+So, possible approaches:
+
+* Before checking out a filter branch, calculate how much space will
+  be used by duplicates and refuse if not enough is free.
+* Only check out one file, and omit the copies. Keep track of which
+  files were omitted, and make sure that when committing on the branch,
+  that metadata is not removed. Has the downside that files can seem
+  to randomly move around in the tree as their metadata changes.
+* Disallow filter branch checkouts that have duplicate files.
+  Note that duplicate files can only occur when filtering on the content
+  of values, not tags. And values can be used in some simple cases w/o
+  duplicate files. This would cripple it some, but perhaps not too badly?
+
+# gotchas
+
+* Checking out a filter branch can remove the current subdir. May be worth
+  detecting when this happens and leaving behind an empty directory so the
+  user can navigate back up.
+
+* Git has a complex set of rules for what is legal in a ref name.
+  Filter branch names will need to filter out any illegal stuff.
+
+* Filesystems that are not case sensative (including case preserving OSX)
+  will cause problems if filter branches try to use different cases for 
+  2 directories representing the value of some metadata. But, users
+  probably want at least case-preserving metadata values. 
+  
+  Solution might be to compare metadata case-insensitively, and
+  pick one representation consistently, so if, for example an author
+  field uses mixed case, it will be used in the filter branch.
+
+  Alternatively, it could escape `A` to `_A` when such a filesystem
+  is detected and avoid collisions that way (double `_` to escape it).
+  This latter option is ugly, but so are non-posix filesystems.. and it
+  also solves any similar issues with case-colliding filenames.
author	Joey Hess <joey@kitenet.net>	2014-02-11 11:37:53 -0400
committer	Joey Hess <joey@kitenet.net>	2014-02-11 11:37:53 -0400
commit	b7cce8dd825b39ea56395c06858b50112459bf1f (patch)
tree	8a3f088c50df1fd6a2824ec3cf2cf03479819c4c
parent	dbd2093acce3eaabaf678286e634bc7f37798d77 (diff)