summaryrefslogtreecommitdiff
path: root/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates
diff options
context:
space:
mode:
authorGravatar Joey Hess <joey@kitenet.net>2011-12-23 11:34:10 -0400
committerGravatar Joey Hess <joey@kitenet.net>2011-12-23 11:34:10 -0400
commit8a2105c90a5bcdc9ee44885e3b57e94046c9f83e (patch)
tree7f9ea70eabbff44c5cb90f66da29249e34e2ea17 /doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates
parentf015ef5fde9184b6756ee74c2be1bb39ae5f54ca (diff)
parentabba5d3e827b5d31766b95b2e2003aa821f289fc (diff)
Merge branch 'master' of ssh://git-annex.branchable.com
Diffstat (limited to 'doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates')
-rw-r--r--doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_7_c39f1bb7c61a89b238c61bee1c049767._comment54
-rw-r--r--doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_8_221ed2e53420278072a6d879c6f251d1._comment8
2 files changed, 62 insertions, 0 deletions
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_7_c39f1bb7c61a89b238c61bee1c049767._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_7_c39f1bb7c61a89b238c61bee1c049767._comment
new file mode 100644
index 000000000..a33700280
--- /dev/null
+++ b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_7_c39f1bb7c61a89b238c61bee1c049767._comment
@@ -0,0 +1,54 @@
+[[!comment format=mdwn
+ username="http://adamspiers.myopenid.com/"
+ nickname="Adam"
+ subject="comment 7"
+ date="2011-12-22T20:04:14Z"
+ content="""
+> My main concern with putting this in git-annex is that finding
+> duplicates necessarily involves storing a list of every key and file
+> in the repository
+
+Only if you want to search the *whole* repository for duplicates, and if
+you do, then you're necessarily going to have to chew up memory in
+some process anyway, so what difference whether it's git-annex or
+(say) a Perl wrapper?
+
+> and git-annex is very carefully built to avoid things that require
+> non-constant memory use, so that it can scale to very big
+> repositories.
+
+That's a worthy goal, but if everything could be implemented with an
+O(1) memory footprint then we'd be in much more pleasant world :-)
+Even O(n) isn't that bad ...
+
+That aside, I like your `--format=\"%f %k\n\"` idea a lot. That opens
+up the \"black box\" of `.git/annex/objects` and makes nice things
+possible, as your pipeline already demonstrates. However, I'm not
+sure why you think `git annex find | sort | uniq` would be more
+efficient. Not only does the sort require the very thing you were
+trying to avoid (i.e. the whole list in memory), but it's also
+O(n log n) which is significantly slower than my O(n) Perl script
+linked above.
+
+More considerations about this pipeline:
+
+* Doesn't it only include locally available files? Ideally it should
+ spot duplicates even when the backing blob is not available locally.
+* What's the point of `--include '*'` ? Doesn't `git annex find`
+ with no arguments already include all files, modulo the requirement
+ above that they're locally available?
+* Any user using this `git annex find | ...` approach is likely to
+ run up against its limitations sooner rather than later, because
+ they're already used to the plethora of options `find(1)` provides.
+ Rather than reinventing the wheel, is there some way `git annex find`
+ could harness the power of `find(1)` ?
+
+Those considerations aside, a combined approach would be to implement
+
+ git annex find --format=...
+
+and then alter my Perl wrapper to `popen(2)` from that rather than using
+`File::Find`. But I doubt you would want to ship Perl wrappers in the
+distribution, so if you don't provide a Haskell equivalent then users
+who can't code are left high and dry.
+"""]]
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_8_221ed2e53420278072a6d879c6f251d1._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_8_221ed2e53420278072a6d879c6f251d1._comment
new file mode 100644
index 000000000..5ac292afe
--- /dev/null
+++ b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_8_221ed2e53420278072a6d879c6f251d1._comment
@@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="http://adamspiers.myopenid.com/"
+ nickname="Adam"
+ subject="How much memory would it actually use anyway?"
+ date="2011-12-22T20:15:22Z"
+ content="""
+Another thought - an SHA1 digest is 20 bytes. That means you can fit over 50 million keys into 1GB of RAM. Granted you also need memory to store the values (pathnames) which in many cases will be longer, and some users may also choose more expensive backends than SHA1 ... but even so, it seems to me that you are at risk of throwing the baby out with the bath water.
+"""]]