doc/todo/smudge.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85

git-annex should use smudge/clean filters.

The trick is doing it efficiently. Since git a2b665d, 2011-01-05,
something like this works to provide a filename to the clean script:

	git config --global filter.huge.clean huge-clean %f

This avoids it needing to read all the current file content from stdin
when doing eg, a git status or git commit. Instead it is passed the
filename that git is operating on, I think that's from the working
directory.

So, WORM could just look at that file and easily tell if it is one
it already knows (same mtime and size). If so, it can short-circuit and
do nothing, file content is already cached.

SHA1 has a harder job. Would not want to re-sha1 the file every time,
probably. So it'd need a cache of file stat info, mapped to known objects.

### dealing with partial content availability

The smudge filter cannot be allowed to fail, that leaves the tree and
index in a weird state. So if a file's content is requested by calling
the smudge filter, the trick is to instead provide dummy content,
indicating it is not available (and perhaps saying to run "git-annex get").

Then, in the clean filter, it has to detect that it's cleaning a file
with that dummy content, and make sure to provide the same identifier as
it would if the file content was there. 

I've a demo implementation of this technique in the scripts below.

----

It may further be possible to use the %f with the smudge filter
(docs say it's supported), and instead of outputting the dummy content, 
it could create a dangling symlink, which would be more like git-annex's
behavior now, and makes it easy to tell what content is not available
with `ls`.

### test files

huge-smudge:

<pre>
#!/bin/sh
read sha1
echo "smudging $sha1" >&2
if [ -e ~/$sha1 ]; then
	cat ~/$sha1
else
	echo "$sha1 not available"
fi
</pre>

huge-clean:

<pre>
#!/bin/sh
cat >temp
if grep -q 'not available' temp; then
	awk '{print $1}' temp # provide what we would if the content were avail!
	rm temp
	exit 0
fi
sha1=`sha1sum temp | cut -d' ' -f1`
echo "cleaning $sha1" >&2
ls -l temp >&2
mv temp ~/$sha1
echo $sha1
</pre>

.gitattributes:

<pre>
*.huge filter=huge
</pre>

in .git/config:

<pre>
[filter "huge"]
        clean = huge-clean
        smudge = huge-smudge
<pre>