There are problems with displaying filenames in UTF8 encoding, as shown here: $ echo $LANG en_GB.UTF-8 $ git init $ git annex init test [...] $ touch "Umlaut Ü.txt" $ git annex add Uml* add Umlaut Ã.txt ok (Recording state in git...) $ find -name U\* | hexdump -C 00000000 2e 2f 55 6d 6c 61 75 74 20 c3 9c 2e 74 78 74 0a |./Umlaut ...txt.| 00000010 $ git annex find | hexdump -C 00000000 55 6d 6c 61 75 74 20 c3 83 c2 9c 2e 74 78 74 0a |Umlaut .....txt.| 00000010 $ It looks like the common latin1-to-UTF8 encoding. Functionality other than otuput seems not to be affected. > Yes, I believe that git-annex is reading filename data from git > as a stream of char8s, and not decoding unicode in it into logical > characters. > Haskell then I guess, tries to unicode encode it when it's output to > the console. > This only seems to matter WRT its output to the console; the data > does not get mangled internally and so it accesses the right files > under the hood. > > I am too new to haskell to really have a handle on how to handle > unicode and other encodings issues with it. In general, there are three > valid approaches: --[[Joey]] > > 1. Convert all input data to unicode and be unicode clean end-to-end > internally. Problimatic here since filenames may not necessarily be > encoded in utf-8 (an archive could have historical filenames using > varying encodings), and you don't want which files are accessed to > depend on locale settings. > > I tried to do this by making parts of GitRepo call > > Codec.Binary.UTF8.String.decodeString when reading filenames from > > git. This seemed to break attempts to operate on the files, > > weirdly encoded strings were seen in syscalls in strace. > 1. Keep input and internal data un-decoded, but decode it when > outputting a filename (assuming the filename is encoded using the > user's configured encoding), and allow haskell's output encoding to then > encode it according to the user's locale configuration. > > This is now [[implemented|done]]. I'm not very happy that I have to watch > > out for any place that a filename is output and call `filePathToString` > > on it, but there are really not too many such places in git-annex. > > > > Note that this only affects filenames apparently. > > (Names of files in the annex, and also some places where names > > of keys are displayed.) Utf-8 in the uuid.map file etc seems > > to be handled cleanly. > 1. Avoid encodings entirely. Mostly what I'm doing now; probably > could find a way to disable encoding of console output. Then the raw > filename would be displayed, which should work ok. git-annex does > not really need to pull apart filenames; they are almost entirely > opaque blobs. I guess that the `--exclude` option is the exception > to that, but it is currently not unicode safe anyway. (Update: tried > `--exclude` again, seems it is unicode clean..) > One other possible > issue would be that this could cause problems if git-annex were > translated. ---- Simpler test case:
import Codec.Binary.UTF8.String
import System.Environment

main = do
        args <- getArgs
        let file = decodeString $ head args
        putStrLn $ "file is: " ++ file
        putStr =<< readFile file
If I pass this a filename like 'ü', it will fail, and notice the bad encoding of the filename in the error message:
$ echo hi > ü; runghc foo.hs ü
file is: ü
foo.hs: �: openFile: does not exist (No such file or directory)
On the other hand, if I remove the decodeString, it prints the filename wrong, while accessing it right:
$ runghc foo.hs ü
file is: üa
hi
The only way that seems to consistently work is to delay decoding the filename to places where it's output. But then it's easy to miss some.