doc/todo/cache_key_info.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

Most of git-annex is designed to be fast no matter how many other files are
in the annex. Things like add/get/drop/move/fsck have good locality;
they will only operate on as many files as you need them to. 

(git commit can get a little slow with a great deal of files,
but that's out of scope -- and recent git-annex versions use queuing
to save git add from piling up too much in the index.)

But currently two git-annex commands are quite slow when annexes become large
in quantity of files. These are unused and stats
(Both have --fast versions that don't do as much).

Both are slow because both need two peices of information that are not
quick to look up, and require examining the whole repo, very seekily:

1. The keys present in the annex. Found by looking thru .git/annex/objects.
2. The keys referenced by files in git. Found by finding every file
   in git, and looking at its symlink.

Of these, the first is less expensive (typically, an annex does not have every
key in it). It could be optimized fairly simply, by adding a database
of keys present in the annex that is optimised to list them all. The
database would be updated by the few functions that move content in and
out.

The second is harder to optimise, because the user can delete, revert,
copy, add, etc files in git at will, and git-annex does not have a good way
to watch that and maintain a database of what keys are being referenced.

It could use a post-commit hook and examine files changed by commits, etc.
But then staged files would be left out. It might be sufficient to 
make --fast trust the database... except unused will suggest *deleting*
data if nothing references it. Or maybe it could be required to have a
clean tree with nothing staged before running git-annex unused.

Anyway, this is a semi-longterm item for me. --[[Joey]]