From b7cce8dd825b39ea56395c06858b50112459bf1f Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 11 Feb 2014 11:37:53 -0400 Subject: every idea that came to me in my sleep. there were rather a lot of them --- doc/design/metadata.mdwn | 121 +++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 100 insertions(+), 21 deletions(-) (limited to 'doc/design/metadata.mdwn') diff --git a/doc/design/metadata.mdwn b/doc/design/metadata.mdwn index 8e409f7d6..be3627d5f 100644 --- a/doc/design/metadata.mdwn +++ b/doc/design/metadata.mdwn @@ -4,11 +4,12 @@ Attach an arbitrary set of metadata to a key. -Metadata can be tags, but it can also be fields with values (ie, date=xxx, -conference=yyy). - Store in git-annex branch, next to location log files. +Metadata can be tags, but it can also be fields with values (ie, date=xxx, +conference=yyy). Fields can have multiple values, for example +multiple authors. + Storage needs to support union merging, including removing tags, and changing values. @@ -20,6 +21,7 @@ when adding it. Could also automatically attach permissions. A git hook could be run by git annex add to gather more metadata. +For example, by examining MP3 metadata. Also auto adds metadata when adding files to filter branches. See below. @@ -28,40 +30,62 @@ Also auto adds metadata when adding files to filter branches. See below. From the ctime, some additional metadata is derived, at least year=yyyy and probably also month, etc. -Should be a general mechanism for this. +This is probably not stored anywhere. It's computed on demand by a pure +function from the other metadata. + +From the set of tags a file has, a "tag" field is derived, which has the +value of each tag. See example below. + +Should be a general mechanism for this. (It probably generalizes to +sql queries if we want to go that far.) # filtered branches `git annex filter year=2014 talk` should create a new branch -filtered/talk/year=2014 containing only files tagged with that, and +filtered/year=2014/talk containing only files tagged with that, and have git check it out. In this example, all files appear in top level directory of repo; no subdirs. `git annex fadd haskell` switches to branch -filtered/haskell/talk/year=2014 with only the haskell talks. +filtered/year=2014/talk/haskell with only the haskell talks. `git annex fadd year=2013 year=2012` switches to branch -filtered/haskell/talk/year=2012,2013,2014. This has subdirectories 2012, +filtered/year=2012,2013,2014/talk/haskell. This has subdirectories 2012, 2013 and 2014 with the matching talks. +Patterns can be used in both the values of fields, and in matching tags. +So, `year=20*` could be used to match years, and `foo/*` matches any +tag in the foo namespace. Or even `*` to match *all* tags. + `git annex frm haskell` switches to -filtered/talk/year=2012,2013,2014, which has all available talks in it. +filtered/year=2012,2013,2014/talk, which has all available talks in it. -`git annex filteradd conference=fosdem conference=icfp` switches to branch -filtered/conference=fosdem,icfp/talk/year=2012,2013,2014. Now we need -to either nest the subdirectories, or make fosdem-2014, icfp-2013, etc. -May need an option to choose this. Note that user may prefer to have year -first or conference first, so may need an option for that as well. +`git annex fadd conference=fosdem conference=icfp` switches to branch +filtered/year=2012,2013,2014/talk/conference=fosdem,icfp. Now there +are nested subdirectories. They follow the format of the branch, +so 2013/icfp, 2014/fosdem, etc. -Note that old filter branches can be deleted when switching to a new one. -There is no need to retain them. Unless the user has committed non -git-annexed files to them, In which case, urk. +`git annex filter tag=haskell,debian` uses the "tag" field that is +automatically derived from the set of tags. So this yields a branch +with hakell and debian subdirectories, containing the files tagged with +either. -These command should probably refuse to do anything if run from within a -subdir of the work tree that would get deleted by checking out the new -filtered branch. +To see all tags, `git annex filter tag=*` ! -# operations while on filter branch +Files not matching the filter can be included, by using +`git annex filter --unmatched=other`. That puts all such files into +the subdirectory other. + +Sometimes you want to see files that do not match a tag, while still +getting subdirectories for + +Note that old filter branches can be deleted when switching to a new one. +There is no need to retain them. Unless the user has committed non-annexed +files to them, In which case, urk. The only reason to use specially named +filtered branches is because it makes self-documenting how the repository +is currently filtered. + +## operations while on filtered branch * If files are removed and git commit called, git-annex should remove the relevant metadata from the files. **possibly** It's not clear that @@ -69,6 +93,8 @@ filtered branch. branch (especially if it's derived metadata like the year). Also, this is not usable in direct mode because deleting the file.. actually deletes it. +* If a file is moved into a new subdirectory while in a filter branch, + a tag is added with the subdir name. This allows on the fly tagging. * `git annex sync` should avoid pushing out the filter branch, but it should check if there are changes to the metadata pulled in, and update the branch to reflect them. @@ -85,6 +111,11 @@ same tree of files filter would. The user can then commit that if desired. Or, they could run additional commands like `git annex fadd` to refine the tree of files in the subdir. +Metadata can be used for configuring numcopies. One way would be a +numcopies=n value attached to a file. But perhaps better would be to make +the numcopies.log allow configuring numcopies based on which files have +other metadata. + Other programs could query git-annex for the metadata of files in the work tree, and do whatever it wants with it. @@ -97,11 +128,59 @@ want to see. * Could use filename metadata for the key, recorded by git-annex add (which may not correspond to filenames being used in regular git branches like master for the key). -* Couod use the .map files to get a filename, but this is somewhat +* Could use the .map files to get a filename, but this is somewhat arbitrary (.map can contain multiple filenames), and is only currently supported in direct mode. +Note that any of these filenames can in theory conflict. May need to use +`.variant-*` like sync does on conflict to allow 2 files with same name in +same filtered branch. + # efficient metadata lookup Looking up metadata for filtering so far requires traversing all keys in the git-annex branch. This is slow. A fast cache is needed. + +# direct mode issues + +Checking out a filter branch can result in any number of copies of a file +appearing in different directories. No problem in indirect mode, but +in direct mode these are real, expensive copies. + +But, it's worth supporting direct mode! + +So, possible approaches: + +* Before checking out a filter branch, calculate how much space will + be used by duplicates and refuse if not enough is free. +* Only check out one file, and omit the copies. Keep track of which + files were omitted, and make sure that when committing on the branch, + that metadata is not removed. Has the downside that files can seem + to randomly move around in the tree as their metadata changes. +* Disallow filter branch checkouts that have duplicate files. + Note that duplicate files can only occur when filtering on the content + of values, not tags. And values can be used in some simple cases w/o + duplicate files. This would cripple it some, but perhaps not too badly? + +# gotchas + +* Checking out a filter branch can remove the current subdir. May be worth + detecting when this happens and leaving behind an empty directory so the + user can navigate back up. + +* Git has a complex set of rules for what is legal in a ref name. + Filter branch names will need to filter out any illegal stuff. + +* Filesystems that are not case sensative (including case preserving OSX) + will cause problems if filter branches try to use different cases for + 2 directories representing the value of some metadata. But, users + probably want at least case-preserving metadata values. + + Solution might be to compare metadata case-insensitively, and + pick one representation consistently, so if, for example an author + field uses mixed case, it will be used in the filter branch. + + Alternatively, it could escape `A` to `_A` when such a filesystem + is detected and avoid collisions that way (double `_` to escape it). + This latter option is ugly, but so are non-posix filesystems.. and it + also solves any similar issues with case-colliding filenames. -- cgit v1.2.3