add escape_var hack

Makes it easy to find files with duplicate contents, anyway.. :)
author: Joey Hess <joey@kitenet.net> 2011-12-22 21:23:11 -0400
committer: Joey Hess <joey@kitenet.net> 2011-12-23 01:08:19 -0400
commit: 7227dd8f21f24c2ccadd38e1a3dec7b888a23e92 (patch)
tree: 23327aedadd128f65cfb09c63abf5de616335fda /doc/tips/finding_duplicate_files.mdwn
parent: 13a0c292b3bc72917cb8ce89e96f805602e81904 (diff)
1 files changed, 21 insertions, 0 deletions
diff --git a/doc/tips/finding_duplicate_files.mdwn b/doc/tips/finding_duplicate_files.mdwn
new file mode 100644
index 000000000..94fc85400
--- /dev/null
+++ b/doc/tips/finding_duplicate_files.mdwn
@@ -0,0 +1,21 @@
+Maybe you had a lot of files scattered around on different drives, and you
+added them all into a single git-annex repository. Some of the files are
+surely duplicates of others.
+
+While git-annex stores the file contents efficiently, it would still
+help in cleaning up this mess if you could find, and perhaps remove
+the duplicate files.
+
+Here's a command line that will show duplicate sets of files grouped together:
+
+	git annex find --include '*' --format='${file} ${escaped_key}\n' | \
+		sort -k2 | uniq --all-repeated=separate -f1 | \
+		sed 's/ [^ ]*$//'
+
+Here's a command line that will remove one of each duplicate set of files:
+
+	git annex find --include '*' --format='${file} ${escaped_key}\n' | \
+		sort -k2 | uniq --repeated -f1 | sed 's/ [^ ]*$//' | \
+		xargs -d '\n' git rm
+
+--[[Joey]]
author	Joey Hess <joey@kitenet.net>	2011-12-22 21:23:11 -0400
committer	Joey Hess <joey@kitenet.net>	2011-12-23 01:08:19 -0400
commit	7227dd8f21f24c2ccadd38e1a3dec7b888a23e92 (patch)
tree	23327aedadd128f65cfb09c63abf5de616335fda /doc/tips/finding_duplicate_files.mdwn
parent	13a0c292b3bc72917cb8ce89e96f805602e81904 (diff)