diff options
Diffstat (limited to 'doc/bugs/problems_with_utf8_names.mdwn')
-rw-r--r-- | doc/bugs/problems_with_utf8_names.mdwn | 81 |
1 files changed, 0 insertions, 81 deletions
diff --git a/doc/bugs/problems_with_utf8_names.mdwn b/doc/bugs/problems_with_utf8_names.mdwn deleted file mode 100644 index aeeb16be6..000000000 --- a/doc/bugs/problems_with_utf8_names.mdwn +++ /dev/null @@ -1,81 +0,0 @@ -This bug is reopened to track some new UTF-8 filename issues caused by GHC -7.4. In this version of GHC, git-annex's hack to support filenames in any -encoding no longer works. Even unicode filenames fail to work when -git-annex is built with 7.4. --[[Joey]] - -This bug is now fixed in current master. Once again, git-annex will work -for all filename encodings, and all system encodings. It will -only build with the new GHC. [[done]] --[[Joey]] - ----- - -Old, now fixed bug report follows: - -There are problems with displaying filenames in UTF8 encoding, as shown here: - - $ echo $LANG - en_GB.UTF-8 - $ git init - $ git annex init test - [...] - $ touch "Umlaut Ü.txt" - $ git annex add Uml* - add Umlaut Ã.txt ok - (Recording state in git...) - $ find -name U\* | hexdump -C - 00000000 2e 2f 55 6d 6c 61 75 74 20 c3 9c 2e 74 78 74 0a |./Umlaut ...txt.| - 00000010 - $ git annex find | hexdump -C - 00000000 55 6d 6c 61 75 74 20 c3 83 c2 9c 2e 74 78 74 0a |Umlaut .....txt.| - 00000010 - $ - -It looks like the common latin1-to-UTF8 encoding. Functionality other than otuput seems not to be affected. - -> Yes, I believe that git-annex is reading filename data from git -> as a stream of char8s, and not decoding unicode in it into logical -> characters. -> Haskell then I guess, tries to unicode encode it when it's output to -> the console. -> This only seems to matter WRT its output to the console; the data -> does not get mangled internally and so it accesses the right files -> under the hood. -> -> I am too new to haskell to really have a handle on how to handle -> unicode and other encodings issues with it. In general, there are three -> valid approaches: --[[Joey]] -> -> 1. Convert all input data to unicode and be unicode clean end-to-end -> internally. Problimatic here since filenames may not necessarily be -> encoded in utf-8 (an archive could have historical filenames using -> varying encodings), and you don't want which files are accessed to -> depend on locale settings. -> > I tried to do this by making parts of GitRepo call -> > Codec.Binary.UTF8.String.decodeString when reading filenames from -> > git. This seemed to break attempts to operate on the files, -> > weirdly encoded strings were seen in syscalls in strace. -> 1. Keep input and internal data un-decoded, but decode it when -> outputting a filename (assuming the filename is encoded using the -> user's configured encoding), and allow haskell's output encoding to then -> encode it according to the user's locale configuration. -> > This is now implemented. I'm not very happy that I have to watch -> > out for any place that a filename is output and call `filePathToString` -> > on it, but there are really not too many such places in git-annex. -> > -> > Note that this only affects filenames apparently. -> > (Names of files in the annex, and also some places where names -> > of keys are displayed.) Utf-8 in the uuid.map file etc seems -> > to be handled cleanly. -> 1. Avoid encodings entirely. Mostly what I'm doing now; probably -> could find a way to disable encoding of console output. Then the raw -> filename would be displayed, which should work ok. git-annex does -> not really need to pull apart filenames; they are almost entirely -> opaque blobs. I guess that the `--exclude` option is the exception -> to that, but it is currently not unicode safe anyway. (Update: tried -> `--exclude` again, seems it is unicode clean..) -> One other possible -> issue would be that this could cause problems if git-annex were -> translated. -> > On second thought, I switched to this. Any decoding of a filename -> > is going to make someone unhappy; the previous approach broke -> > non-utf8 filenames. |