doc/design/metadata.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199

[[!toc]]

# metadata

Attach an arbitrary set of metadata to a key. This consists of any number
of fields. Each field has an unordered set of values. The special field
"tag" has as its values any tags that are set for the key.

Store in git-annex branch, next to location log files.

Storage needs to support union merging, including removing an old value
of a field, and adding a new value of a field.

## automatically added metadata

git annex add should automatically attach the current mtime of a file
when adding it.

Could also automatically attach permissions.

A git hook could be run by git annex add to gather more metadata.
For example, by examining MP3 metadata.

Also auto adds metadata when adding files to filter branches. See below.

## derived metadata

From the ctime, some additional 
metadata is derived, at least year=yyyy and probably also month, etc.

This is probably not stored anywhere. It's computed on demand by a pure
function from the other metadata.

Should be a general mechanism for this. (It probably generalizes to
sql queries if we want to go that far.)

# filtered branches

`git annex view year=2014 talk` should create a new branch
view/year=2014/talk containing only files tagged with that, and
have git check it out. In this example, all files appear in top level
directory of repo; no subdirs.

`git annex vadd haskell` switches to branch
view/year=2014/talk/haskell with only the haskell talks.

`git annex vadd year=2013 year=2012` switches to branch
view/year=2012,2013,2014/talk/haskell. This has subdirectories 2012,
2013 and 2014 with the matching talks.

Patterns can be used in both the values of fields, and in matching tags.
So, `year=20*` could be used to match years, and `foo/*` matches any
tag in the foo namespace. Or even `*` to match *all* tags.

`git annex vrm haskell` switches to
view/year=2012,2013,2014/talk, which has all available talks in it.

`git annex vadd conference=fosdem conference=icfp` switches to branch
view/year=2012,2013,2014/talk/conference=fosdem,icfp. Now there
are nested subdirectories. They follow the format of the branch,
so 2013/icfp, 2014/fosdem, etc.

`git annex view tag=haskell,debian` yields a branch with haskell
and debian subdirectories.

To see all tags, as subdirectories, `git annex view tag=*` !

Files not matching the view can be included, by using 
`git annex view --unmatched=other`. That puts all such files into
the subdirectory other.

Note that old filter branches can be deleted when switching to a new one.
There is no need to retain them. Unless the user has committed non-annexed
files to them, In which case, urk. The only reason to use specially named
filtered branches is because it makes self-documenting how the repository
is currently filtered.

## operations while on filtered branch

* If files are removed and git commit called, git-annex should remove the
  relevant metadata from the files. **possibly** It's not clear that
  removing a file should nuke all the metadata used to filter it into the
  branch (especially if it's derived metadata like the year).
  Also, this is not usable in direct mode because deleting the
  file.. actually deletes it.
* If a file is moved into a new subdirectory while in a filter branch,
  a tag is added with the subdir name. This allows on the fly tagging.
* `git annex sync` should avoid pushing out the filter branch, but
  it should check if there are changes to the metadata pulled in, and update
  the branch to reflect them.
* If `git annex add` adds a file, it gets all the metadata of the filter
  branch it's added to. If it's in a relevent directory (like fosdem-2014),
  it gets that metadata automatically recorded as well.

# other uses for metadata

Uses are not limited to filter branches.

`git annex checkoutmeta year=2014 talk` in a subdir of master could create the
same tree of files filter would. The user can then commit that if desired.
Or, they could run additional commands like `git annex fadd` to refine the
tree of files in the subdir.

Metadata can be used for configuring numcopies. One way would be a
numcopies=n value attached to a file. But perhaps better would be to make
the numcopies.log allow configuring numcopies based on which files have
other metadata.

Other programs could query git-annex for the metadata of files in the work
tree, and do whatever it wants with it.

# filenames

The hard part of this is actually getting a useful filename to put in the
filter branch, since git-annex only has a key which the user will not
want to see.

* Could use filename metadata for the key, recorded by git-annex add (which
  may not correspond to filenames being used in regular git branches like
  master for the key).
* Could use the .map files to get a filename, but this is somewhat
  arbitrary (.map can contain multiple filenames), and is only
  currently supported in direct mode.
* Have a reference branch (eg master) and walk it to find filenames and
  keys. Fine as long as it can be done efficiently. Also allows including
  the subdirectory a file is in, potentially. cwebber points out that this
  is essentially a form of tracking branch. Which implies it will need to
  be updatable when the reference branch changes. Should be doable via
  diff-tree.

Note that any of these filenames can in theory conflict. May need to use
`.variant-*` like sync does on conflict to allow 2 files with same name in
same filtered branch.

## union merge properties

While the storage could just list all the current values of a field on a
line with a timestamp, that's not good enough. Two disconnected
repositories can make changes to the values of a field (setting and
unsetting tags for example) and when this is union merged back together,
the changes need to be able to be replayed in order to determine which
values we end up with. 

To make that work, we log not only when a field is set to a value, 
but when a value is unset as well.

For example, here two different remotes added tags, and then later
a tag was removed:

	1287290776.765152s tag +foo +bar
	1287290991.152124s tag +baz
	1291237510.141453s tag -bar

# efficient metadata lookup

Looking up metadata for filtering so far requires traversing all keys in
the git-annex branch. This is slow. A fast cache is needed.

# direct mode issues

Checking out a filter branch can result in any number of copies of a file
appearing in different directories. No problem in indirect mode, but
in direct mode these are real, expensive copies.

But, it's worth supporting direct mode!

So, possible approaches:

* Before checking out a filter branch, calculate how much space will
  be used by duplicates and refuse if not enough is free.
* Only check out one file, and omit the copies. Keep track of which
  files were omitted, and make sure that when committing on the branch,
  that metadata is not removed. Has the downside that files can seem
  to randomly move around in the tree as their metadata changes.
* Disallow filter branch checkouts that have duplicate files.
  This would cripple it some, but perhaps not too badly?

# gotchas

* Checking out a filter branch can remove the current subdir. May be worth
  detecting when this happens and leaving behind an empty directory so the
  user can navigate back up.

* Git has a complex set of rules for what is legal in a ref name.
  Filter branch names will need to filter out any illegal stuff.

* Filesystems that are not case sensative (including case preserving OSX)
  will cause problems if filter branches try to use different cases for 
  2 directories representing the value of some metadata. But, users
  probably want at least case-preserving metadata values. 
  
  Solution might be to compare metadata case-insensitively, and
  pick one representation consistently, so if, for example an author
  field uses mixed case, it will be used in the filter branch.

  Alternatively, it could escape `A` to `_A` when such a filesystem
  is detected and avoid collisions that way (double `_` to escape it).
  This latter option is ugly, but so are non-posix filesystems.. and it
  also solves any similar issues with case-colliding filenames.