aboutsummaryrefslogtreecommitdiff
path: root/doc/tips/finding_duplicate_files
diff options
context:
space:
mode:
authorGravatar sameerds <sameerds@web>2013-12-31 10:24:17 +0000
committerGravatar admin <admin@branchable.com>2013-12-31 10:24:17 +0000
commit6ea88f82215edfc34dc3243a00553165bc14bd3b (patch)
tree2d3a82180da61ad65210de5065c109480d8b4c1b /doc/tips/finding_duplicate_files
parentaa75ad1560bba1462947bdb2fa58a297d0c368f0 (diff)
Added a comment: a shell script that handles spaces in file names
Diffstat (limited to 'doc/tips/finding_duplicate_files')
-rw-r--r--doc/tips/finding_duplicate_files/comment_11_5efc6b6ee1dfec88512183e9679ca616._comment24
1 files changed, 24 insertions, 0 deletions
diff --git a/doc/tips/finding_duplicate_files/comment_11_5efc6b6ee1dfec88512183e9679ca616._comment b/doc/tips/finding_duplicate_files/comment_11_5efc6b6ee1dfec88512183e9679ca616._comment
new file mode 100644
index 000000000..03ab1b3c7
--- /dev/null
+++ b/doc/tips/finding_duplicate_files/comment_11_5efc6b6ee1dfec88512183e9679ca616._comment
@@ -0,0 +1,24 @@
+[[!comment format=mdwn
+ username="sameerds"
+ ip="106.51.197.116"
+ subject="a shell script that handles spaces in file names"
+ date="2013-12-31T10:24:06Z"
+ content="""
+I used the following shell pipeline to remove duplicate files in one go:
+
+ (1) git annex find --format='${key}:${file}\n' \
+ (2) | cut -d '-' -f 4- \
+ (3) | sort \
+ (4) | uniq --all-repeated=separate -w 40 \
+ (5) | awk -vRS= -vFS='\n' '{for (i = 2; i <= NF; i++) print $i}' \
+ (6) | cut -d ':' -f 2- \
+ (7) | xargs -d '\n' git rm
+
+1. Generate a list of keys and file names separated by a colon (':').
+2. Cut out the initial part of the key so that the hash is at the beginning of the line. The `-f 4-` ensures that dashes in the filename do not result in truncation.
+3. Sort the entire list.
+4. Uniquify and print duplicates in groups separated by blank lines. Use the first 40 characters, which matches the length of a SHA1 hash. Other hashes will require a different length.
+5. Use awk to print all but the first line in each group. The empty `-vRS` sets blank line as the record separator, and the `-vFS` sets newline as the field separator. The for-loop prints each field except the first.
+6. Cut out the key and keep only the file name by relying on the colon introduced in the first step.
+7. Use xargs to separate file names by newline, which takes care of spaces in the file names. Send this list of arguments to `git rm`.
+"""]]