summaryrefslogtreecommitdiff
path: root/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_7_c39f1bb7c61a89b238c61bee1c049767._comment
diff options
context:
space:
mode:
Diffstat (limited to 'doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_7_c39f1bb7c61a89b238c61bee1c049767._comment')
-rw-r--r--doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_7_c39f1bb7c61a89b238c61bee1c049767._comment54
1 files changed, 0 insertions, 54 deletions
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_7_c39f1bb7c61a89b238c61bee1c049767._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_7_c39f1bb7c61a89b238c61bee1c049767._comment
deleted file mode 100644
index a33700280..000000000
--- a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_7_c39f1bb7c61a89b238c61bee1c049767._comment
+++ /dev/null
@@ -1,54 +0,0 @@
-[[!comment format=mdwn
- username="http://adamspiers.myopenid.com/"
- nickname="Adam"
- subject="comment 7"
- date="2011-12-22T20:04:14Z"
- content="""
-> My main concern with putting this in git-annex is that finding
-> duplicates necessarily involves storing a list of every key and file
-> in the repository
-
-Only if you want to search the *whole* repository for duplicates, and if
-you do, then you're necessarily going to have to chew up memory in
-some process anyway, so what difference whether it's git-annex or
-(say) a Perl wrapper?
-
-> and git-annex is very carefully built to avoid things that require
-> non-constant memory use, so that it can scale to very big
-> repositories.
-
-That's a worthy goal, but if everything could be implemented with an
-O(1) memory footprint then we'd be in much more pleasant world :-)
-Even O(n) isn't that bad ...
-
-That aside, I like your `--format=\"%f %k\n\"` idea a lot. That opens
-up the \"black box\" of `.git/annex/objects` and makes nice things
-possible, as your pipeline already demonstrates. However, I'm not
-sure why you think `git annex find | sort | uniq` would be more
-efficient. Not only does the sort require the very thing you were
-trying to avoid (i.e. the whole list in memory), but it's also
-O(n log n) which is significantly slower than my O(n) Perl script
-linked above.
-
-More considerations about this pipeline:
-
-* Doesn't it only include locally available files? Ideally it should
- spot duplicates even when the backing blob is not available locally.
-* What's the point of `--include '*'` ? Doesn't `git annex find`
- with no arguments already include all files, modulo the requirement
- above that they're locally available?
-* Any user using this `git annex find | ...` approach is likely to
- run up against its limitations sooner rather than later, because
- they're already used to the plethora of options `find(1)` provides.
- Rather than reinventing the wheel, is there some way `git annex find`
- could harness the power of `find(1)` ?
-
-Those considerations aside, a combined approach would be to implement
-
- git annex find --format=...
-
-and then alter my Perl wrapper to `popen(2)` from that rather than using
-`File::Find`. But I doubt you would want to ship Perl wrappers in the
-distribution, so if you don't provide a Haskell equivalent then users
-who can't code are left high and dry.
-"""]]