doc/todo/smudge.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319

git-annex should use smudge/clean filters.

----

Update: Currently, this does not look likely to work. In particular,
the clean filter needs to consume all stdin from git, which consists of the
entire content of the file. It cannot optimise by directly accessing
the file in the repository, because git may be cleaning a different
version of the file during a merge. 

So every `git status` would need to read the entire content of all
available files, and checksum them, which is too expensive.

> Update from GitTogether: Peff thinks a new interface could be added to
> git to handle this sort of case in an efficient way.. just needs someone
> to do the work. --[[Joey]] 

>> Update 2015: git status only calls the clean filter for files
>> that the index says are modified, so this is no longer a problem.
>> --[[Joey]]

----

The clean filter is run when files are staged for commit. So a user could copy
any file into the annex, git add it, and git-annex's clean filter causes
the file's key to be staged, while its value is added to the annex.

The smudge filter is run when files are checked out. Since git annex
repos have partial content, this would not git annex get the file content.
Instead, if the content is not currently available, it would need to do
something like return empty file content. (Sadly, it cannot create a
symlink, as git still wants to write the file afterwards.)

So the nice current behavior of unavailable files being clearly missing due
to dangling symlinks, would be lost when using smudge/clean filters.
(Contact git developers to get an interface to do this?)

Instead, we get the nice behavior of not having to remeber to `git annex
add` files, and just being able to use `git add` or `git commit -a`,
and have it use git-annex when .gitattributes says to. Also, annexed
files can be directly modified without having to `git annex unlock`.

### configuration

In .gitattributes, the user would put something like "* filter=git-annex".
This way they could control which files are annexed vs added normally.

It would also be good to allow using this without having to specify
the files in .gitattributes. Just use "* filter=git-annex" there, and then
let git-annex decide which files to annex and which to pass through the
smudge and clean filters as-is. The smudge filter can just read a little of
its input to see if it's a pointer to an annexed file. The clean filter
could apply annex.largefiles to decide whether to annex a file's content or
not.

For files not configured this way in .gitattributes, git-annex could
continue to use its symlink method -- this would preserve backwards
compatability, and even allow mixing the two methods in a repo as desired.
(But not switching an existing repo between indirect and direct modes;
the user decides which mode to use when adding files to the repo.)

### clean

The trick is doing it efficiently. Since git a2b665d, v1.7.4.1,
something like this works to provide a filename to the clean script:

	git config --global filter.huge.clean huge-clean %f

This could avoid it needing to read all the current file content from stdin
when doing eg, a git status or git commit. Instead it is passed the
filename that git is operating on, in the working directory.
(Update: No, doesn't work; git may be cleaning a different file content
than is currently on disk, and git requires all stdin be consumed too.)

So, WORM could just look at that file and easily tell if it is one
it already knows (same mtime and size). If so, it can short-circuit and
do nothing, file content is already cached.

SHA1 has a harder job. Would not want to re-sha1 the file every time,
probably. So it'd need a local cache of file stat info, mapped to known
objects.

But: Even with %f, git actually passes the full file content to the clean
filter, and if it fails to consume it all, it will crash (may only happen
if the file is larger than some chunk size; tried with 500 mb file and 
saw a SIGPIPE.) This means unnecessary works needs to be done, 
and it slows down *everything*, from `git status` to `git commit`.
**showstopper** I have sent a patch to the git mailing list to address
this. <http://marc.info/?l=git&m=131465033512157&w=2> (Update: apparently
can't be fixed.)

> Update: I tried this again (2015) and it seems that git status and git
> add avoid re-sending the file content to the clean filter, as long as the
> file stat has not changed. I'm not sure when git started doing that,
> but it seems to avoid this problem.
> --[[Joey]]

### smudge

The smudge script can also be provided a filename with %f, but it
cannot directly write to the file or git gets unhappy.

> Still the case in 2015. Means an unnecesary read and pipe of the file
> even if the content is already locally available on disk. --[[Joey]]

### partial checkouts

.. Are very important, otherwise a repo can't scale past the size of the
smallest client's disk!

It would be nice if the smudge filter could hard link or symlink a work
tree file to the annex object.

But currently, the smudge filter can't modify the work tree file on its own
-- git always modifies the file after getting the output of the smudge
filter, and will stumble over any modifications that the smudge filter
makes. And, it's important that the smudge filter never fail as that will
leave the repo in a bad state.

Seems the best that can be done is for the smudge filter to copy from the
annex object when the object is present. When it's not present, the smudge
filter should provide a pointer to its content.

The clean filter should detect when it's operating on that pointer file.

I've a demo implementation of this technique in the scripts below.

### deduplication

.. Is nice; needing 2 copies of every annexed file is annoying.

Unfortunately, when using smudge/clean, `git merge` does not preserve a
smudged file in the work tree when renaming it. It instead deletes the old
file and asks the smudge filter to smudge the new filename.

So, copies need to be maintained in .git/annex/objects, though it's ok
to use hard links to the work tree files.

Even if hard links are used, smudge needs to output the content of an
annexed file, which will result in duplication when merging in renames of
files.

### design

Goal: Get rid of current direct mode, using smudge/clean filters instead to
cover the same use cases, more flexibly and robustly.

Use case 1:

A user wants to be able to edit files, and git-add, git commit,
without needing to worry about using git-annex to unlock files, add files,
etc.

Use case 2:

Using git-annex on a crippled filesystem that does not support symlinks.

Data:

* An annex pointer file has as its first line the git-annex key
  that it's standing in for. Subsequent lines of the file might
  be a message saying that the file's content is not currently available.
  An annex pointer file is checked into the git repository the same way
  that an annex symlink is checked in.
* file2key maps are maintained by git-annex, to keep track of
  what files are pointers at keys.

Configuration: 

* .gitattributes tells git which files to use git-annex's smudge/clean
  filters with. Typically, all files except for dotfiles:

	* filter=annex
	.* !filter

* annex.largefiles tells git-annex which files should in fact be put in 
  the annex. Other files are passed through the smudge/clean as-is and
  have their contents stored in git.

* annex.direct is repurposed to configure how the assistant adds files.
  When set to true, they're added unlocked.

git-annex clean:

* Run by `git add` (and diff and status, etc), and passed the
  filename, as well as fed the file content on stdin.

  Look at configuration to decide if this file's content belongs in the
  annex. If not, output the file content to stdout.

  Generate annex key from filename and content from stdin.

  Hard link .git/annex/objects to the file, if it doesn't already exist.
  (On platforms not supporting hardlinks, copy the file to
  .git/annex/objects.)

  This is done to prevent losing the only copy of a file when eg
  doing a git checkout of a different branch, or merging a commit that
  renames or deletes a file. But, no attempt is made to 
  protect the object from being modified. If a user wants to
  protect object contents from modification, they should use
  `git annex add`, not `git add`, or they can `git annex lock` after adding,.

  There could be a configuration knob to cause a copy to be made to
  .git/annex/objects -- useful for those crippled filesystems. It might
  also drop that copy once the object gets uploaded to another repo ...
  But that gets complicated quickly.

  Update file2key map.

  Output the pointer file content to stdout.

git-annex smudge:

* Run by eg `git checkout` and passed the filename, as well as fed
  the pointer file content on stdin.

  Updates file2key map.

  When an object is present in the annex, outputs its content to stdout.
  Otherwise, outputs the file pointer content.

git annex direct/indirect:

  Previously these commands switched in and out of direct mode.
  Now they become no-ops.

git annex lock/unlock:

  Makes sense for these to change to switch files between using
  git-annex symlinks and pointers. So, this provides both a way to
  transition repositories to using pointers, and a cleaner unlock/lock
  for repos using symlinks.

  unlock will stage a pointer file, and will copy the content of the object
  out of .git/annex/objects to the work tree file. (Might want a --hardlink
  switch.)
  
  lock will replace the current work tree file with the symlink, and stage it.
  Note that multiple work tree files could point to the same object.
  So, if the link count is > 1, replace the annex object with a copy of
  itself to break such a hard link. Always finish by locking down the
  permissions of the annex object.

All other git-annex commands that look at annex symlinks to get keys will
need fall back to checking if a given work tree file is stored in git as
pointer file. This can be done by checking the file2key map (or by looking
it up in the index).

Note that I have not verified if file2key maps can be maintained
consistently using the smudge/clean filters. Seems likely to work,
based on when I see smudge/clean filters being run. The file2key
optimisation may not be needed though, looking at the index 
might be fast enough.

#### Upgrading

annex.version changes to 6

Upgrade should be handled automatically.

On upgrade, update .gitattributes with a stock configuration, unless
it already mentions "filter=annex".

Upgrading a direct mode repo needs to switch it out of bare mode, and
needs to run `git annex unlock` on all files (or reach the same result).
So will need to stage changes to all annexed files.

When a repo has some clones indirect and some direct, the upgraded repo
will have all files unlocked, necessarily in all clones.

----

### test files

huge-smudge:

<pre>
#!/bin/sh
read f
file="$1"
echo "smudging $f" >&2
if [ -e ~/$f ]; then
	cat ~/$f # possibly expensive copy here
else
	echo "$f not available"
fi
</pre>

huge-clean:

<pre>
#!/bin/sh
file="$1"
cat >/tmp/file
# in real life, this should be done more efficiently, not trying to read
# the whole file content!
if grep -q 'not available' /tmp/file; then
	awk '{print $1}' /tmp/file # provide what we would if the content were avail!
	exit 0
fi
echo "cleaning $file" >&2
# XXX store file content here
echo $file
</pre>

.gitattributes:

<pre>
*.huge filter=huge
</pre>

in .git/config:

<pre>
[filter "huge"]
        clean = huge-clean %f
        smudge = huge-smudge %f
<pre>