summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGravatar Joey Hess <joey@kitenet.net>2011-12-23 14:09:36 -0400
committerGravatar Joey Hess <joey@kitenet.net>2011-12-23 14:09:36 -0400
commit499b9dcacdf9551db515b06b80d197d7a022c3b0 (patch)
tree914dd2eaa641b6397ee19aa9f63565173f78222b
parent1940e52793ff20cf1c48f78fef95985434ed7dee (diff)
parentcbaf13e587e340debf22aa1de505f4ebb15f583c (diff)
Merge branch 'master' of ssh://git-annex.branchable.com
-rw-r--r--doc/forum/git_pull_remote_git-annex/comment_6_3925d1aa56bce9380f712e238d63080f._comment8
-rw-r--r--doc/forum/git_pull_remote_git-annex/comment_7_24c45ee981b18bc78325c768242e635d._comment8
-rw-r--r--doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_10_d78d79fb2f3713aa69f45d2691cf8dfe._comment68
-rw-r--r--doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_11_4316d9d74312112dc4c823077af7febe._comment8
-rw-r--r--doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_12_ed6d07f16a11c6eee7e3d5005e8e6fa3._comment8
5 files changed, 100 insertions, 0 deletions
diff --git a/doc/forum/git_pull_remote_git-annex/comment_6_3925d1aa56bce9380f712e238d63080f._comment b/doc/forum/git_pull_remote_git-annex/comment_6_3925d1aa56bce9380f712e238d63080f._comment
new file mode 100644
index 000000000..f4b5ebec2
--- /dev/null
+++ b/doc/forum/git_pull_remote_git-annex/comment_6_3925d1aa56bce9380f712e238d63080f._comment
@@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="http://adamspiers.myopenid.com/"
+ nickname="Adam"
+ subject="comment 6"
+ date="2011-12-23T17:14:03Z"
+ content="""
+Extending `git annex sync` would be nice, although auto-commit does not suit every use case, so it would be better not to couple one to the other.
+"""]]
diff --git a/doc/forum/git_pull_remote_git-annex/comment_7_24c45ee981b18bc78325c768242e635d._comment b/doc/forum/git_pull_remote_git-annex/comment_7_24c45ee981b18bc78325c768242e635d._comment
new file mode 100644
index 000000000..dad2c0af2
--- /dev/null
+++ b/doc/forum/git_pull_remote_git-annex/comment_7_24c45ee981b18bc78325c768242e635d._comment
@@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="http://adamspiers.myopenid.com/"
+ nickname="Adam"
+ subject="comment 7"
+ date="2011-12-23T17:24:58Z"
+ content="""
+P.S. I see you already [fixed the docs](http://source.git-annex.branchable.com/?p=source.git;a=commitdiff;h=a0227e81f9c82afc12ac1bd1cecd63cc0894d751) - thanks! :)
+"""]]
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_10_d78d79fb2f3713aa69f45d2691cf8dfe._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_10_d78d79fb2f3713aa69f45d2691cf8dfe._comment
new file mode 100644
index 000000000..5dbb66cf6
--- /dev/null
+++ b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_10_d78d79fb2f3713aa69f45d2691cf8dfe._comment
@@ -0,0 +1,68 @@
+[[!comment format=mdwn
+ username="http://adamspiers.myopenid.com/"
+ nickname="Adam"
+ subject="comment 10"
+ date="2011-12-23T17:22:11Z"
+ content="""
+> Your perl script is not O(n). Inserting into perl hash tables has
+> overhead of minimum O(n log n).
+
+What's your source for this assertion? I would expect an amortized
+average of `O(1)` per insertion, i.e. `O(n)` for full population.
+
+> Not counting the overhead of resizing hash tables,
+> the grevious slowdown if the bucket size is overcome by data (it
+> probably falls back to a linked list or something then), and the
+> overhead of traversing the hash tables to get data out.
+
+None of which necessarily change the algorithmic complexity. However
+real benchmarks are far more useful here than complexity analysis, and
+[the dangers of premature optimization](http://c2.com/cgi/wiki?PrematureOptimization)
+should not be forgotten.
+
+> Your memory size calculations ignore the overhead of a hash table or
+> other data structure to store the data in, which will tend to be
+> more than the actual data size it's storing. I estimate your 50
+> million number is off by at least one order of magnitude, and more
+> likely two;
+
+Sure, I was aware of that, but my point still stands. Even 500k keys
+per 1GB of RAM does not sound expensive to me.
+
+> in any case I don't want git-annex to use 1 gb of ram.
+
+Why not? What's the maximum it should use? 512MB? 256MB?
+32MB? I don't see the sense in the author of a program
+dictating thresholds which are entirely dependent on the context
+in which the program is *run*, not the context in which it's *written*.
+That's why systems have files such as `/etc/security/limits.conf`.
+
+You said you want git-annex to scale to enormous repositories. If you
+impose an arbitrary memory restriction such as the above, that means
+avoiding implementing *any* kind of functionality which requires `O(n)`
+memory or worse. Isn't it reasonable to assume that many users use
+git-annex on repositories which are *not* enormous? Even when they do
+work with enormous repositories, just like with any other program,
+they would naturally expect certain operations to take longer or
+become impractical without sufficient RAM. That's why I say that this
+restriction amounts to throwing out the baby with the bathwater.
+It just means that those who need the functionality would have to
+reimplement it themselves, assuming they are able, which is likely
+to result in more wheel reinventions. I've already shared
+[my implementation](https://github.com/aspiers/git-config/blob/master/bin/git-annex-finddups)
+but how many people are likely to find it, let alone get it working?
+
+> Little known fact: sort(1) will use a temp file as a buffer if too
+> much memory is needed to hold the data to sort.
+
+Interesting. Presumably you are referring to some undocumented
+behaviour, rather than `--batch-size` which only applies when merging
+multiple files, and not when only sorting STDIN.
+
+> It's also written in the most efficient language possible and has
+> been ruthlessly optimised for 30 years, so I would be very surprised
+> if it was not the best choice.
+
+It's the best choice for sorting. But sorting purely to detect
+duplicates is a dismally bad choice.
+"""]]
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_11_4316d9d74312112dc4c823077af7febe._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_11_4316d9d74312112dc4c823077af7febe._comment
new file mode 100644
index 000000000..286487eee
--- /dev/null
+++ b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_11_4316d9d74312112dc4c823077af7febe._comment
@@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="http://joey.kitenet.net/"
+ nickname="joey"
+ subject="comment 11"
+ date="2011-12-23T17:52:21Z"
+ content="""
+I don't think that [[tips/finding_duplicate_files]] is hard to find, and the multiple different ways it shows to deal with the duplicate files shows the flexability of the unix pipeline approach.
+"""]]
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_12_ed6d07f16a11c6eee7e3d5005e8e6fa3._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_12_ed6d07f16a11c6eee7e3d5005e8e6fa3._comment
new file mode 100644
index 000000000..909beed83
--- /dev/null
+++ b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_12_ed6d07f16a11c6eee7e3d5005e8e6fa3._comment
@@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="http://joey.kitenet.net/"
+ nickname="joey"
+ subject="comment 12"
+ date="2011-12-23T18:02:24Z"
+ content="""
+BTW, sort -S '90%' benchmarks consistently 2x as fast as perl's hashes all the way up to 1 million files. Of course the pipeline approach allows you to swap in perl or whatever else is best for you at scale.
+"""]]