Added a comment

author: http://joey.kitenet.net/ <joey@web> 2011-12-23 16:07:39 +0000
committer: admin <admin@branchable.com> 2011-12-23 16:07:39 +0000
commit: 538665f4779825427f7f42d6245e83032459951b (patch)
tree: 7721872b10b04a4352f1c0efed231bbf0e847f8a
parent: 8a2105c90a5bcdc9ee44885e3b57e94046c9f83e (diff)
1 files changed, 14 insertions, 0 deletions
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_9_aecfa896c97b9448f235bce18a40621d._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_9_aecfa896c97b9448f235bce18a40621d._comment
new file mode 100644
index 000000000..82c6921eb
--- /dev/null
+++ b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_9_aecfa896c97b9448f235bce18a40621d._comment
@@ -0,0 +1,14 @@
+[[!comment format=mdwn
+ username="http://joey.kitenet.net/"
+ nickname="joey"
+ subject="comment 9"
+ date="2011-12-23T16:07:39Z"
+ content="""
+Adam, to answer a lot of points breifly..
+
+* --include='*' makes find list files whether their contents are present or not
+* Your perl script is not O(n). Inserting into perl hash tables has overhead of minimum O(n log n). Not counting the overhead of resizing hash tables, the grevious slowdown if the bucket size is overcome by data (it probably falls back to a linked list or something then), and the overhead of traversing the hash tables to get data out.
+* I think that git-annex's set of file matching options is coming along nicely, and new ones can easily be added, so see no need to pull in unix find(1).
+* Your memory size calculations ignore the overhead of a hash table or other data structure to store the data in, which will tend to be more than the actual data size it's storing.  I estimate your 50 million number is off by at least one order of magnitude, and more likely two; in any case I don't want git-annex to use 1 gb of ram.
+* Little known fact: sort(1) will use a temp file as a buffer if too much memory is needed to hold the data to sort. It's also written in the most efficient language possible and has been ruthlessly optimised for 30 years, so I would be very surprised if it was not the best choice.
+"""]]
author	http://joey.kitenet.net/ <joey@web>	2011-12-23 16:07:39 +0000
committer	admin <admin@branchable.com>	2011-12-23 16:07:39 +0000
commit	538665f4779825427f7f42d6245e83032459951b (patch)
tree	7721872b10b04a4352f1c0efed231bbf0e847f8a
parent	8a2105c90a5bcdc9ee44885e3b57e94046c9f83e (diff)