diff options
author | Joey Hess <joey@kitenet.net> | 2011-04-07 14:45:10 -0400 |
---|---|---|
committer | Joey Hess <joey@kitenet.net> | 2011-04-07 14:45:10 -0400 |
commit | 4ea0b7c28850eb703562cd9dc84a02c49b5fda00 (patch) | |
tree | 955e2a91578328fafd235225bd7d5d8a82727987 /doc | |
parent | 6fa86946d5c7a35aa73ee342f8388398f0f0fbd5 (diff) |
add
Diffstat (limited to 'doc')
-rw-r--r-- | doc/todo/git-annex_unused_eats_memory.mdwn | 25 |
1 files changed, 25 insertions, 0 deletions
diff --git a/doc/todo/git-annex_unused_eats_memory.mdwn b/doc/todo/git-annex_unused_eats_memory.mdwn new file mode 100644 index 000000000..6ce714004 --- /dev/null +++ b/doc/todo/git-annex_unused_eats_memory.mdwn @@ -0,0 +1,25 @@ +`git-annex unused` has to compare large sets of data +(all keys with content present in the repository, +with all keys used by files in the repository), and so +uses more memory than git-annex typically needs; around +60-80 mb when run in a repository with 80 thousand files. + +I would like to reduce this. One idea is to use a bloom filter. +For example, construct a bloom filter of all keys used by files in +the repository. Then for each key with content present, check if it's +in the bloom filter. Since there can be false negatives, this might +miss finding some unused keys. The probability/size of filter +could be tunable. + +Another way might be to scan the git log for files that got removed +or changed what key they pointed to. Correlate with keys with content +currently present in the repository (possibly using a bloom filter again), +and that would yield a shortlist of keys that are probably not used. +Then scan thru all files in the repo to make sure that none point to keys +on the shortlist. + +---- + +`git annex unused --from remote` is much worse, using hundreds of mb of +memory. It has not been profiled at all yet, and can probably be improved +somewhat by fixing whatever memory leak it (probably) has. |