aboutsummaryrefslogtreecommitdiff
path: root/doc/bugs/Linux_standalone__39__s_metadata_--batch_can__39__t_parse_UTF-8/comment_1_1765400777911cc61eb591b76c84ae89._comment
diff options
context:
space:
mode:
Diffstat (limited to 'doc/bugs/Linux_standalone__39__s_metadata_--batch_can__39__t_parse_UTF-8/comment_1_1765400777911cc61eb591b76c84ae89._comment')
-rw-r--r--doc/bugs/Linux_standalone__39__s_metadata_--batch_can__39__t_parse_UTF-8/comment_1_1765400777911cc61eb591b76c84ae89._comment45
1 files changed, 45 insertions, 0 deletions
diff --git a/doc/bugs/Linux_standalone__39__s_metadata_--batch_can__39__t_parse_UTF-8/comment_1_1765400777911cc61eb591b76c84ae89._comment b/doc/bugs/Linux_standalone__39__s_metadata_--batch_can__39__t_parse_UTF-8/comment_1_1765400777911cc61eb591b76c84ae89._comment
new file mode 100644
index 000000000..4a15b1987
--- /dev/null
+++ b/doc/bugs/Linux_standalone__39__s_metadata_--batch_can__39__t_parse_UTF-8/comment_1_1765400777911cc61eb591b76c84ae89._comment
@@ -0,0 +1,45 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 1"""
+ date="2016-12-19T20:37:56Z"
+ content="""
+runshell was recently changed to bypass using the system locales, it
+includes its own locale data and attempts to generate a locale definition
+file for the locale. The code that did that was failing to notice that
+en_GB.UTF-8 was a UTF-8 locale (en_GB.utf8 would work though), which
+explains why the locale is not set inside runshell
+(git-annex.linux/git-annex is a script that uses runshell). I've corrected
+that problem, and verified it fixes the problem you reported.
+
+----
+
+However.. The same thing happens when using LANG=C with git-annex
+installed by any method and --json --batch. So the deeper problem is that
+it's forcing the batch input to be decoded as utf8 via the current locale.
+This happens in Command/MetaData.hs parseJSONInput which uses
+`BU.fromString`.
+
+I tried swapping in `encodeBS` for `BU.fromString`. That prevented the
+decoding error, but made git-annex complain that the file was not annexed,
+due to a Mojibake problem:
+
+With `encodeBS`, the input `{"file":"ü.txt"}` is encoded as
+`"{\"file\":\"\195\188.txt\"}"`. Aeson parses that input to this:
+
+ JSONActionItem {itemCommand = Nothing, itemKey = Nothing, itemFile = Just "\252.txt", itemAdded = Nothing}
+
+Note that the first two bytes have been
+parsed by Aeson as unicode (since JSON is unicode encoded),
+yielding character 252 (ü).
+
+In a unicode locale, this works ok, because the encoding layer is able to
+convert that unicode character back to two bytes 195 188
+and finds the file on disk. But in a non-unicode locale, it doesn't know
+what to do with the unicode character, and in fact it gets discarded
+and so it looks for a file named ".txt".
+
+So, to make --batch --json input work in non-unicode locales, it would
+need, after parsing the json, to re-encode filenames (and perhaps other
+data), from utf8 to the filesystem encoding. I have not yet worked out how
+to do that.
+"""]]