doc/bugs/Linux_standalone__39__s_metadata_--batch_can__39__t_parse_UTF-8/comment_1_1765400777911cc61eb591b76c84ae89._comment


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

[[!comment format=mdwn
 username="joey"
 subject="""comment 1"""
 date="2016-12-19T20:37:56Z"
 content="""
runshell was recently changed to bypass using the system locales, it
includes its own locale data and attempts to generate a locale definition
file for the locale. The code that did that was failing to notice that
en_GB.UTF-8 was a UTF-8 locale (en_GB.utf8 would work though), which
explains why the locale is not set inside runshell
(git-annex.linux/git-annex is a script that uses runshell). I've corrected
that problem, and verified it fixes the problem you reported.

----

However.. The same thing happens when using LANG=C with git-annex
installed by any method and --json --batch. So the deeper problem is that
it's forcing the batch input to be decoded as utf8 via the current locale.
This happens in Command/MetaData.hs parseJSONInput which uses
`BU.fromString`. 

I tried swapping in `encodeBS` for `BU.fromString`. That prevented the
decoding error, but made git-annex complain that the file was not annexed,
due to a Mojibake problem: 

With `encodeBS`, the input `{"file":"ü.txt"}` is encoded as
`"{\"file\":\"\195\188.txt\"}"`. Aeson parses that input to this:

	JSONActionItem {itemCommand = Nothing, itemKey = Nothing, itemFile = Just "\252.txt", itemAdded = Nothing}

Note that the first two bytes have been
parsed by Aeson as unicode (since JSON is unicode encoded),
yielding character 252 (ü).

In a unicode locale, this works ok, because the encoding layer is able to
convert that unicode character back to two bytes 195 188 
and finds the file on disk. But in a non-unicode locale, it doesn't know
what to do with the unicode character, and in fact it gets discarded
and so it looks for a file named ".txt".

So, to make --batch --json input work in non-unicode locales, it would
need, after parsing the json, to re-encode filenames (and perhaps other
data), from utf8 to the filesystem encoding. I have not yet worked out how
to do that.
"""]]