diff options
author | Joey Hess <joeyh@joeyh.name> | 2018-03-05 11:25:01 -0400 |
---|---|---|
committer | Joey Hess <joeyh@joeyh.name> | 2018-03-05 11:25:01 -0400 |
commit | da3f2ee6994daafe58b890c3fb87ccf5ef61f3f2 (patch) | |
tree | 661efe702c741449882fd21e1840dae1b1548253 | |
parent | df575f0db7c945a26735d0944b05c7e989cdfcda (diff) |
Improve SHA*E extension extraction code
Do not treat parts of the filename that contain punctuation or other
non-alphanumeric characters as extensions. Before, such characters were
filtered out.
Note that in 38bd7ca3cce455c20edcee656c706939087c6a69 "foo.ba__________r"
was munged to ".bar" and so incorrectly treated as an extension. That was
fixed by changing the filter order, but not allowing punctuation seems a
better fix.
This assumes that extensions containing punctuation are rare. "_" seems the
most likely character; I used it in ikiwiki "._comment" files. But I can't
recall seeing it anywhere else. It certianly seems that no commonly used
extensions contain punctuation. If git-annex doesn't treat "._comment"
as an extension, it's not likely to break software that expects to see that
extension like some software expects to see .epub or .mp3.
This commit was sponsored by Jack Hill on Patreon.
4 files changed, 34 insertions, 1 deletions
diff --git a/Backend/Hash.hs b/Backend/Hash.hs index da0f7df9b..1d5436823 100644 --- a/Backend/Hash.hs +++ b/Backend/Hash.hs @@ -94,7 +94,7 @@ selectExtension f | otherwise = intercalate "." ("":es) where es = filter (not . null) $ reverse $ - take 2 $ map (filter validInExtension) $ + take 2 $ filter (all validInExtension) $ takeWhile shortenough $ reverse $ splitc '.' $ takeExtensions f shortenough e = length e <= 4 -- long enough for "jpeg" @@ -3,6 +3,9 @@ git-annex (6.20180228) UNRELEASED; urgency=medium * Support exporttree=yes for rsync special remotes. * Dial back optimisation when building on arm, which prevents ghc and llc from running out of memory when optimising some files. + * Improve SHA*E extension extraction code to not treat parts of the + filename that contain punctuation or other non-alphanumeric characters + as extensions. Before, such characters were filtered out. -- Joey Hess <id@joeyh.name> Wed, 28 Feb 2018 11:53:03 -0400 diff --git a/doc/bugs/git-annex_adds_unicode_characters_at_end_of_checksum.mdwn b/doc/bugs/git-annex_adds_unicode_characters_at_end_of_checksum.mdwn index 84ca70bea..0534925ea 100644 --- a/doc/bugs/git-annex_adds_unicode_characters_at_end_of_checksum.mdwn +++ b/doc/bugs/git-annex_adds_unicode_characters_at_end_of_checksum.mdwn @@ -3,6 +3,8 @@ Files with special unicode characters(in this case japanese) for some reason hav This is an issue because it causes errors when using glacier-cli when uploading copies to Glacier vault. +[[!meta title="kanji in key extension cause glacier-cli upload error"]] + ### What steps will reproduce the problem? Here's how it looks for me: diff --git a/doc/bugs/git-annex_adds_unicode_characters_at_end_of_checksum/comment_5_7f5a6ba6ed7b6f720874f8ded6edaa3c._comment b/doc/bugs/git-annex_adds_unicode_characters_at_end_of_checksum/comment_5_7f5a6ba6ed7b6f720874f8ded6edaa3c._comment new file mode 100644 index 000000000..1d8e1cabe --- /dev/null +++ b/doc/bugs/git-annex_adds_unicode_characters_at_end_of_checksum/comment_5_7f5a6ba6ed7b6f720874f8ded6edaa3c._comment @@ -0,0 +1,28 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 5""" + date="2018-03-05T14:47:20Z" + content=""" +The easy workaround to bugs like this migrate the file to the +SHA256 backend rather than SHA256E. + +It may be obvious to us that a file ending in "(feat. xy).mp3" +has an extension of ".mp3" and not of ". xy).mp3", but this is not very +obvious to git-annex, which would like to treat a file ending in ".tar.gz" +as having that compound extension. + +The only rule I can think of that would help git-annex understand this is +to not allow punctuation (other than "." in file extensions). Which it +actually already filters out of extensions, which is why the extension it +comes up with is ".xy.mp3". But it could notice the space and closing paren +in the filename and assume those are not part of an extension. It might +bite some file with an extension like .foo_", I can't recall seeing many +such extensions. Ok, made this change. + +It remains a bug in the glacier special remote if unicode characters +prevent uploading to it. We can't limit file +extensions to ascii, it's perfectly reasonable to use your native language +characters in a file extension. Leaving bug open since my change does +nothing about whatever upload bug glacier-cli has. Is the python program +failing? +"""]] |