path: root/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates
diff options
Diffstat (limited to 'doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates')
12 files changed, 0 insertions, 286 deletions
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_10_d78d79fb2f3713aa69f45d2691cf8dfe._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_10_d78d79fb2f3713aa69f45d2691cf8dfe._comment
deleted file mode 100644
index 5dbb66cf6..000000000
--- a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_10_d78d79fb2f3713aa69f45d2691cf8dfe._comment
+++ /dev/null
@@ -1,68 +0,0 @@
-[[!comment format=mdwn
- username="http://adamspiers.myopenid.com/"
- nickname="Adam"
- subject="comment 10"
- date="2011-12-23T17:22:11Z"
- content="""
-> Your perl script is not O(n). Inserting into perl hash tables has
-> overhead of minimum O(n log n).
-What's your source for this assertion? I would expect an amortized
-average of `O(1)` per insertion, i.e. `O(n)` for full population.
-> Not counting the overhead of resizing hash tables,
-> the grevious slowdown if the bucket size is overcome by data (it
-> probably falls back to a linked list or something then), and the
-> overhead of traversing the hash tables to get data out.
-None of which necessarily change the algorithmic complexity. However
-real benchmarks are far more useful here than complexity analysis, and
-[the dangers of premature optimization](http://c2.com/cgi/wiki?PrematureOptimization)
-should not be forgotten.
-> Your memory size calculations ignore the overhead of a hash table or
-> other data structure to store the data in, which will tend to be
-> more than the actual data size it's storing. I estimate your 50
-> million number is off by at least one order of magnitude, and more
-> likely two;
-Sure, I was aware of that, but my point still stands. Even 500k keys
-per 1GB of RAM does not sound expensive to me.
-> in any case I don't want git-annex to use 1 gb of ram.
-Why not? What's the maximum it should use? 512MB? 256MB?
-32MB? I don't see the sense in the author of a program
-dictating thresholds which are entirely dependent on the context
-in which the program is *run*, not the context in which it's *written*.
-That's why systems have files such as `/etc/security/limits.conf`.
-You said you want git-annex to scale to enormous repositories. If you
-impose an arbitrary memory restriction such as the above, that means
-avoiding implementing *any* kind of functionality which requires `O(n)`
-memory or worse. Isn't it reasonable to assume that many users use
-git-annex on repositories which are *not* enormous? Even when they do
-work with enormous repositories, just like with any other program,
-they would naturally expect certain operations to take longer or
-become impractical without sufficient RAM. That's why I say that this
-restriction amounts to throwing out the baby with the bathwater.
-It just means that those who need the functionality would have to
-reimplement it themselves, assuming they are able, which is likely
-to result in more wheel reinventions. I've already shared
-[my implementation](https://github.com/aspiers/git-config/blob/master/bin/git-annex-finddups)
-but how many people are likely to find it, let alone get it working?
-> Little known fact: sort(1) will use a temp file as a buffer if too
-> much memory is needed to hold the data to sort.
-Interesting. Presumably you are referring to some undocumented
-behaviour, rather than `--batch-size` which only applies when merging
-multiple files, and not when only sorting STDIN.
-> It's also written in the most efficient language possible and has
-> been ruthlessly optimised for 30 years, so I would be very surprised
-> if it was not the best choice.
-It's the best choice for sorting. But sorting purely to detect
-duplicates is a dismally bad choice.
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_11_4316d9d74312112dc4c823077af7febe._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_11_4316d9d74312112dc4c823077af7febe._comment
deleted file mode 100644
index 286487eee..000000000
--- a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_11_4316d9d74312112dc4c823077af7febe._comment
+++ /dev/null
@@ -1,8 +0,0 @@
-[[!comment format=mdwn
- username="http://joey.kitenet.net/"
- nickname="joey"
- subject="comment 11"
- date="2011-12-23T17:52:21Z"
- content="""
-I don't think that [[tips/finding_duplicate_files]] is hard to find, and the multiple different ways it shows to deal with the duplicate files shows the flexability of the unix pipeline approach.
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_12_ed6d07f16a11c6eee7e3d5005e8e6fa3._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_12_ed6d07f16a11c6eee7e3d5005e8e6fa3._comment
deleted file mode 100644
index 909beed83..000000000
--- a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_12_ed6d07f16a11c6eee7e3d5005e8e6fa3._comment
+++ /dev/null
@@ -1,8 +0,0 @@
-[[!comment format=mdwn
- username="http://joey.kitenet.net/"
- nickname="joey"
- subject="comment 12"
- date="2011-12-23T18:02:24Z"
- content="""
-BTW, sort -S '90%' benchmarks consistently 2x as fast as perl's hashes all the way up to 1 million files. Of course the pipeline approach allows you to swap in perl or whatever else is best for you at scale.
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_1_fd213310ee548d8726791d2b02237fde._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_1_fd213310ee548d8726791d2b02237fde._comment
deleted file mode 100644
index 094e4526e..000000000
--- a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_1_fd213310ee548d8726791d2b02237fde._comment
+++ /dev/null
@@ -1,29 +0,0 @@
-[[!comment format=mdwn
- username="http://joey.kitenet.net/"
- nickname="joey"
- subject="comment 1"
- date="2011-01-27T18:29:44Z"
- content="""
-Hey Asheesh, I'm happy you're finding git-annex useful.
-So, there are two forms of duplication going on here. There's duplication of the content, and duplication of the filenames
-pointing at that content.
-Duplication of the filenames is probably not a concern, although it's what I thought you were talking about at first. It's probably info worth recording that backup-2010/some_dir/foo and backup-2009/other_dir/foo are two names you've used for the same content in the past. If you really wanted to remove backup-2009/foo, you could do it by writing a script that looks at the basenames of the symlink targets and removes files that point to the same content as other files.
-Using SHA1 ensures that the same key is used for identical files, so generally avoids duplication of content. But if you have 2 disks with an identical file on each, and make them both into annexes, then git-annex will happily retain both
-copies of the content, one per disk. It generally considers keeping copies of content a good thing. :)
-So, what if you want to remove the unnecessary copies? Well, there's a really simple way:
-cd /media/usb-1
-git remote add other-disk /media/usb-0
-git annex add
-git annex drop
-This asks git-annex to add everything to the annex, but then remove any file contents that it can safely remove. What can it safely remove? Well, anything that it can verify is on another repository such as \"other-disk\"! So, this will happily drop any duplicated file contents, while leaving all the rest alone.
-In practice, you might not want to have all your old backup disks mounted at the same time and configured as remotes. Look into configuring [[trust]] to avoid needing do to that. If usb-0 is already a trusted disk, all you need is a simple \"git annex drop\" on usb-1.
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_2_4394bde1c6fd44acae649baffe802775._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_2_4394bde1c6fd44acae649baffe802775._comment
deleted file mode 100644
index 04d58a459..000000000
--- a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_2_4394bde1c6fd44acae649baffe802775._comment
+++ /dev/null
@@ -1,18 +0,0 @@
-[[!comment format=mdwn
- username="https://www.google.com/accounts/o8/id?id=AItOawkjvjLHW9Omza7x1VEzIFQ8Z5honhRB90I"
- nickname="Asheesh"
- subject="I actually *do* want to avoid duplication of filenames"
- date="2011-01-28T07:30:05Z"
- content="""
-I really do want just one filename per file, at least for some cases.
-For my photos, there's no benefit to having a few filenames point to the same file. As I'm putting them all into the git-annex, that is a good time to remove the pure duplicates so that I don't e.g. see them twice when browsing the directory as a gallery. Also, I am uploading my photos to the web, and I want to avoid uploading the same photo (by content) twice.
-I hope that makes things clearer!
-For now I'm just doing this:
-* paulproteus@renaissance:/mnt/backups-terabyte/paulproteus/sd-card-from-2011-01-06/sd-cards/DCIM/100CANON $ for file in *; do hash=$(sha1sum \"$file\"); if ls /home/paulproteus/Photos/in-flickr/.git-annex | grep -q \"$hash\"; then echo already annexed ; else flickr_upload \"$file\" && mv \"$file\" \"/home/paulproteus/Photos/in-flickr/2011-01-28/from-some-nested-sd-card-bk\" && (cd /home/paulproteus/Photos/in-flickr/2011-01-28/from-some-nested-sd-card-bk && git annex add . && git commit -m ...) ; fi; done
-(Yeah, Flickr for my photos for now. I feel sad about betraying the principle of autonomo.us-ness.)
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_3_076cb22057583957d5179d8ba9004605._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_3_076cb22057583957d5179d8ba9004605._comment
deleted file mode 100644
index d11119bc3..000000000
--- a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_3_076cb22057583957d5179d8ba9004605._comment
+++ /dev/null
@@ -1,18 +0,0 @@
-[[!comment format=mdwn
- username="https://www.google.com/accounts/o8/id?id=AItOawkjvjLHW9Omza7x1VEzIFQ8Z5honhRB90I"
- nickname="Asheesh"
- subject="Duplication of the filenames is what I am concerned about"
- date="2011-04-29T11:48:22Z"
- content="""
-For what it's worth, yes, I want to actually forget I ever had the same file in the filesystem with a duplicated name. I'm not just aiming to clean up the disk's space usage; I'm also aiming to clean things up so that navigating the filesystem is easier.
-I can write my own script to do that based on the symlinks' target (and I wrote something along those lines), but I still think it'd be nicer if git-annex supported this use case.
-<pre>git annex drop --by-contents</pre>
-could let me remove a file from git-annex if the contents are available through a different name. (Right now, \"git annex drop\" requires the name *and* contents match.)
--- Asheesh.
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_4_f120d1e83c1a447f2ecce302fc69cf74._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_4_f120d1e83c1a447f2ecce302fc69cf74._comment
deleted file mode 100644
index a218ee3d5..000000000
--- a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_4_f120d1e83c1a447f2ecce302fc69cf74._comment
+++ /dev/null
@@ -1,35 +0,0 @@
-[[!comment format=mdwn
- username="http://adamspiers.myopenid.com/"
- nickname="Adam"
- subject="List the duplicate filenames, then let the user decide what to do"
- date="2011-12-22T12:31:29Z"
- content="""
-I have the same use case as Asheesh but I want to be able to see which filenames point to the same objects and then decide which of the duplicates to drop myself. I think
- git annex drop --by-contents
-would be the wrong approach because how does git-annex know which ones to drop? There's too much potential for error.
-Instead it would be great to have something like
- git annex finddups
-While it's easy enough to knock up a bit of shell or Perl to achieve this, that relies on knowledge of the annex symlink structure, so I think really it belongs inside git-annex.
-If this command gave output similar to the excellent `fastdup` utility:
- Scanning for files... 672 files in 10.439 seconds
- Comparing 2 sets of files...
- 2 files (70.71 MB/ea)
- /home/adam/media/flat/tour/flat-tour.3gp
- /home/adam/videos/tour.3gp
- Found 1 duplicate of 1 file (70.71 MB wasted)
- Scanned 672 files (1.96 GB) in 11.415 seconds
-then you could do stuff like
- git annex finddups | grep /home/adam/media/flat | xargs rm
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_5_5c30294b3c59fdebb1eef0ae5da4cd4f._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_5_5c30294b3c59fdebb1eef0ae5da4cd4f._comment
deleted file mode 100644
index e48a4a9b3..000000000
--- a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_5_5c30294b3c59fdebb1eef0ae5da4cd4f._comment
+++ /dev/null
@@ -1,10 +0,0 @@
-[[!comment format=mdwn
- username="http://adamspiers.myopenid.com/"
- nickname="Adam"
- subject="Here's a Perl version"
- date="2011-12-22T15:43:51Z"
- content="""
-but it would be better in git-annex itself ...
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_6_f24541ada1c86d755acba7e9fa7cff24._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_6_f24541ada1c86d755acba7e9fa7cff24._comment
deleted file mode 100644
index 5d8ac8e61..000000000
--- a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_6_f24541ada1c86d755acba7e9fa7cff24._comment
+++ /dev/null
@@ -1,16 +0,0 @@
-[[!comment format=mdwn
- username="http://joey.kitenet.net/"
- nickname="joey"
- subject="comment 6"
- date="2011-12-22T16:39:24Z"
- content="""
-My main concern with putting this in git-annex is that finding duplicates necessarily involves storing a list of every key and file in the repository, and git-annex is very carefully built to avoid things that require non-constant memory use, so that it can scale to very big repositories. (The only exception is the `unused` command, and reducing its memory usage is a continuing goal.)
-So I would rather come at this from a different angle.. like providing a way to output a list of files and their associated keys, which the user can then use in their own shell pipelines to find duplicate keys:
- git annex find --include '*' --format='${file} ${key}\n' | sort --key 2 | uniq --all-repeated --skip-fields=1
-Which is implemented now!
-(Making that pipeline properly handle filenames with spaces is left as an exercise for the reader..)
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_7_c39f1bb7c61a89b238c61bee1c049767._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_7_c39f1bb7c61a89b238c61bee1c049767._comment
deleted file mode 100644
index a33700280..000000000
--- a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_7_c39f1bb7c61a89b238c61bee1c049767._comment
+++ /dev/null
@@ -1,54 +0,0 @@
-[[!comment format=mdwn
- username="http://adamspiers.myopenid.com/"
- nickname="Adam"
- subject="comment 7"
- date="2011-12-22T20:04:14Z"
- content="""
-> My main concern with putting this in git-annex is that finding
-> duplicates necessarily involves storing a list of every key and file
-> in the repository
-Only if you want to search the *whole* repository for duplicates, and if
-you do, then you're necessarily going to have to chew up memory in
-some process anyway, so what difference whether it's git-annex or
-(say) a Perl wrapper?
-> and git-annex is very carefully built to avoid things that require
-> non-constant memory use, so that it can scale to very big
-> repositories.
-That's a worthy goal, but if everything could be implemented with an
-O(1) memory footprint then we'd be in much more pleasant world :-)
-Even O(n) isn't that bad ...
-That aside, I like your `--format=\"%f %k\n\"` idea a lot. That opens
-up the \"black box\" of `.git/annex/objects` and makes nice things
-possible, as your pipeline already demonstrates. However, I'm not
-sure why you think `git annex find | sort | uniq` would be more
-efficient. Not only does the sort require the very thing you were
-trying to avoid (i.e. the whole list in memory), but it's also
-O(n log n) which is significantly slower than my O(n) Perl script
-linked above.
-More considerations about this pipeline:
-* Doesn't it only include locally available files? Ideally it should
- spot duplicates even when the backing blob is not available locally.
-* What's the point of `--include '*'` ? Doesn't `git annex find`
- with no arguments already include all files, modulo the requirement
- above that they're locally available?
-* Any user using this `git annex find | ...` approach is likely to
- run up against its limitations sooner rather than later, because
- they're already used to the plethora of options `find(1)` provides.
- Rather than reinventing the wheel, is there some way `git annex find`
- could harness the power of `find(1)` ?
-Those considerations aside, a combined approach would be to implement
- git annex find --format=...
-and then alter my Perl wrapper to `popen(2)` from that rather than using
-`File::Find`. But I doubt you would want to ship Perl wrappers in the
-distribution, so if you don't provide a Haskell equivalent then users
-who can't code are left high and dry.
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_8_221ed2e53420278072a6d879c6f251d1._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_8_221ed2e53420278072a6d879c6f251d1._comment
deleted file mode 100644
index 5ac292afe..000000000
--- a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_8_221ed2e53420278072a6d879c6f251d1._comment
+++ /dev/null
@@ -1,8 +0,0 @@
-[[!comment format=mdwn
- username="http://adamspiers.myopenid.com/"
- nickname="Adam"
- subject="How much memory would it actually use anyway?"
- date="2011-12-22T20:15:22Z"
- content="""
-Another thought - an SHA1 digest is 20 bytes. That means you can fit over 50 million keys into 1GB of RAM. Granted you also need memory to store the values (pathnames) which in many cases will be longer, and some users may also choose more expensive backends than SHA1 ... but even so, it seems to me that you are at risk of throwing the baby out with the bath water.
diff --git a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_9_aecfa896c97b9448f235bce18a40621d._comment b/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_9_aecfa896c97b9448f235bce18a40621d._comment
deleted file mode 100644
index 82c6921eb..000000000
--- a/doc/todo/wishlist:_Provide_a___34__git_annex__34___command_that_will_skip_duplicates/comment_9_aecfa896c97b9448f235bce18a40621d._comment
+++ /dev/null
@@ -1,14 +0,0 @@
-[[!comment format=mdwn
- username="http://joey.kitenet.net/"
- nickname="joey"
- subject="comment 9"
- date="2011-12-23T16:07:39Z"
- content="""
-Adam, to answer a lot of points breifly..
-* --include='*' makes find list files whether their contents are present or not
-* Your perl script is not O(n). Inserting into perl hash tables has overhead of minimum O(n log n). Not counting the overhead of resizing hash tables, the grevious slowdown if the bucket size is overcome by data (it probably falls back to a linked list or something then), and the overhead of traversing the hash tables to get data out.
-* I think that git-annex's set of file matching options is coming along nicely, and new ones can easily be added, so see no need to pull in unix find(1).
-* Your memory size calculations ignore the overhead of a hash table or other data structure to store the data in, which will tend to be more than the actual data size it's storing. I estimate your 50 million number is off by at least one order of magnitude, and more likely two; in any case I don't want git-annex to use 1 gb of ram.
-* Little known fact: sort(1) will use a temp file as a buffer if too much memory is needed to hold the data to sort. It's also written in the most efficient language possible and has been ruthlessly optimised for 30 years, so I would be very surprised if it was not the best choice.