summaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authorGravatar Joey Hess <joey@kitenet.net>2011-04-07 14:45:10 -0400
committerGravatar Joey Hess <joey@kitenet.net>2011-04-07 14:45:10 -0400
commit4ea0b7c28850eb703562cd9dc84a02c49b5fda00 (patch)
tree955e2a91578328fafd235225bd7d5d8a82727987 /doc
parent6fa86946d5c7a35aa73ee342f8388398f0f0fbd5 (diff)
add
Diffstat (limited to 'doc')
-rw-r--r--doc/todo/git-annex_unused_eats_memory.mdwn25
1 files changed, 25 insertions, 0 deletions
diff --git a/doc/todo/git-annex_unused_eats_memory.mdwn b/doc/todo/git-annex_unused_eats_memory.mdwn
new file mode 100644
index 000000000..6ce714004
--- /dev/null
+++ b/doc/todo/git-annex_unused_eats_memory.mdwn
@@ -0,0 +1,25 @@
+`git-annex unused` has to compare large sets of data
+(all keys with content present in the repository,
+with all keys used by files in the repository), and so
+uses more memory than git-annex typically needs; around
+60-80 mb when run in a repository with 80 thousand files.
+
+I would like to reduce this. One idea is to use a bloom filter.
+For example, construct a bloom filter of all keys used by files in
+the repository. Then for each key with content present, check if it's
+in the bloom filter. Since there can be false negatives, this might
+miss finding some unused keys. The probability/size of filter
+could be tunable.
+
+Another way might be to scan the git log for files that got removed
+or changed what key they pointed to. Correlate with keys with content
+currently present in the repository (possibly using a bloom filter again),
+and that would yield a shortlist of keys that are probably not used.
+Then scan thru all files in the repo to make sure that none point to keys
+on the shortlist.
+
+----
+
+`git annex unused --from remote` is much worse, using hundreds of mb of
+memory. It has not been profiled at all yet, and can probably be improved
+somewhat by fixing whatever memory leak it (probably) has.