aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGravatar Joey Hess <joeyh@joeyh.name>2018-03-05 11:25:01 -0400
committerGravatar Joey Hess <joeyh@joeyh.name>2018-03-05 11:25:01 -0400
commitda3f2ee6994daafe58b890c3fb87ccf5ef61f3f2 (patch)
tree661efe702c741449882fd21e1840dae1b1548253
parentdf575f0db7c945a26735d0944b05c7e989cdfcda (diff)
Improve SHA*E extension extraction code
Do not treat parts of the filename that contain punctuation or other non-alphanumeric characters as extensions. Before, such characters were filtered out. Note that in 38bd7ca3cce455c20edcee656c706939087c6a69 "foo.ba__________r" was munged to ".bar" and so incorrectly treated as an extension. That was fixed by changing the filter order, but not allowing punctuation seems a better fix. This assumes that extensions containing punctuation are rare. "_" seems the most likely character; I used it in ikiwiki "._comment" files. But I can't recall seeing it anywhere else. It certianly seems that no commonly used extensions contain punctuation. If git-annex doesn't treat "._comment" as an extension, it's not likely to break software that expects to see that extension like some software expects to see .epub or .mp3. This commit was sponsored by Jack Hill on Patreon.
-rw-r--r--Backend/Hash.hs2
-rw-r--r--CHANGELOG3
-rw-r--r--doc/bugs/git-annex_adds_unicode_characters_at_end_of_checksum.mdwn2
-rw-r--r--doc/bugs/git-annex_adds_unicode_characters_at_end_of_checksum/comment_5_7f5a6ba6ed7b6f720874f8ded6edaa3c._comment28
4 files changed, 34 insertions, 1 deletions
diff --git a/Backend/Hash.hs b/Backend/Hash.hs
index da0f7df9b..1d5436823 100644
--- a/Backend/Hash.hs
+++ b/Backend/Hash.hs
@@ -94,7 +94,7 @@ selectExtension f
| otherwise = intercalate "." ("":es)
where
es = filter (not . null) $ reverse $
- take 2 $ map (filter validInExtension) $
+ take 2 $ filter (all validInExtension) $
takeWhile shortenough $
reverse $ splitc '.' $ takeExtensions f
shortenough e = length e <= 4 -- long enough for "jpeg"
diff --git a/CHANGELOG b/CHANGELOG
index e78ff93be..38a947116 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -3,6 +3,9 @@ git-annex (6.20180228) UNRELEASED; urgency=medium
* Support exporttree=yes for rsync special remotes.
* Dial back optimisation when building on arm, which prevents
ghc and llc from running out of memory when optimising some files.
+ * Improve SHA*E extension extraction code to not treat parts of the
+ filename that contain punctuation or other non-alphanumeric characters
+ as extensions. Before, such characters were filtered out.
-- Joey Hess <id@joeyh.name> Wed, 28 Feb 2018 11:53:03 -0400
diff --git a/doc/bugs/git-annex_adds_unicode_characters_at_end_of_checksum.mdwn b/doc/bugs/git-annex_adds_unicode_characters_at_end_of_checksum.mdwn
index 84ca70bea..0534925ea 100644
--- a/doc/bugs/git-annex_adds_unicode_characters_at_end_of_checksum.mdwn
+++ b/doc/bugs/git-annex_adds_unicode_characters_at_end_of_checksum.mdwn
@@ -3,6 +3,8 @@ Files with special unicode characters(in this case japanese) for some reason hav
This is an issue because it causes errors when using glacier-cli when uploading copies to Glacier vault.
+[[!meta title="kanji in key extension cause glacier-cli upload error"]]
+
### What steps will reproduce the problem?
Here's how it looks for me:
diff --git a/doc/bugs/git-annex_adds_unicode_characters_at_end_of_checksum/comment_5_7f5a6ba6ed7b6f720874f8ded6edaa3c._comment b/doc/bugs/git-annex_adds_unicode_characters_at_end_of_checksum/comment_5_7f5a6ba6ed7b6f720874f8ded6edaa3c._comment
new file mode 100644
index 000000000..1d8e1cabe
--- /dev/null
+++ b/doc/bugs/git-annex_adds_unicode_characters_at_end_of_checksum/comment_5_7f5a6ba6ed7b6f720874f8ded6edaa3c._comment
@@ -0,0 +1,28 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 5"""
+ date="2018-03-05T14:47:20Z"
+ content="""
+The easy workaround to bugs like this migrate the file to the
+SHA256 backend rather than SHA256E.
+
+It may be obvious to us that a file ending in "(feat. xy).mp3"
+has an extension of ".mp3" and not of ". xy).mp3", but this is not very
+obvious to git-annex, which would like to treat a file ending in ".tar.gz"
+as having that compound extension.
+
+The only rule I can think of that would help git-annex understand this is
+to not allow punctuation (other than "." in file extensions). Which it
+actually already filters out of extensions, which is why the extension it
+comes up with is ".xy.mp3". But it could notice the space and closing paren
+in the filename and assume those are not part of an extension. It might
+bite some file with an extension like .foo_", I can't recall seeing many
+such extensions. Ok, made this change.
+
+It remains a bug in the glacier special remote if unicode characters
+prevent uploading to it. We can't limit file
+extensions to ascii, it's perfectly reasonable to use your native language
+characters in a file extension. Leaving bug open since my change does
+nothing about whatever upload bug glacier-cli has. Is the python program
+failing?
+"""]]