diff options
author | Joey Hess <joey@kitenet.net> | 2011-02-10 14:21:44 -0400 |
---|---|---|
committer | Joey Hess <joey@kitenet.net> | 2011-02-10 14:21:44 -0400 |
commit | fe55b4644e67bba60b35e07abcdd312b65c9d6f3 (patch) | |
tree | 4631f428f86f72d614f9b5388772b6ec58a3fb8d /doc | |
parent | e7a3475704f5366e89aebe78cefbeb58ff5ab181 (diff) |
Fix display of unicode filenames.
Internally, the filenames are stored as un-decoded unicode.
I tried decoding them, but then haskell tries to access the wrong files.
Hmm.
So, I've unhappily chosen option "B", which is to decode filenames before
they are displayed.
Diffstat (limited to 'doc')
-rw-r--r-- | doc/bugs/problems_with_utf8_names.mdwn | 22 | ||||
-rw-r--r-- | doc/bugs/unhappy_without_UTF8_locale.mdwn | 33 |
2 files changed, 45 insertions, 10 deletions
diff --git a/doc/bugs/problems_with_utf8_names.mdwn b/doc/bugs/problems_with_utf8_names.mdwn index 30f3495f4..257f8dff2 100644 --- a/doc/bugs/problems_with_utf8_names.mdwn +++ b/doc/bugs/problems_with_utf8_names.mdwn @@ -37,10 +37,22 @@ It looks like the common latin1-to-UTF8 encoding. Functionality other than otupu > encoded in utf-8 (an archive could have historical filenames using > varying encodings), and you don't want which files are accessed to > depend on locale settings. +> > I tried to do this by making parts of GitRepo call +> > Codec.Binary.UTF8.String.decodeString when reading filenames from +> > git. This seemed to break attempts to operate on the files, +> > weirdly encoded strings were seen in syscalls in strace. > 1. Keep input and internal data un-decoded, but decode it when > outputting a filename (assuming the filename is encoded using the > user's configured encoding), and allow haskell's output encoding to then > encode it according to the user's locale configuration. +> > This is now [[implemented|done]]. I'm not very happy that I have to watch +> > out for any place that a filename is output and call `showFile` +> > on it, but there are really not too many such places in git-annex. +> > +> > Note that this only affects filenames apparently. +> > (Names of files in the annex, and also some places where names +> > of keys are displayed.) Utf-8 in the uuid.map file etc seems +> > to be handled cleanly. > 1. Avoid encodings entirely. Mostly what I'm doing now; probably > could find a way to disable encoding of console output. Then the raw > filename would be displayed, which should work ok. git-annex does @@ -50,13 +62,3 @@ It looks like the common latin1-to-UTF8 encoding. Functionality other than otupu > One other possible > issue would be that this could cause problems if git-annex were > translated. -> -> BTW, for more fun, try unsetting LANG, and then you can see -> stuff like this: - - joey@gnu:~/tmp/aa>git annex add ./Üa - add add add add git-annex: <stdout>: commitAndReleaseBuffer: invalid - argument (Invalid or incomplete multibyte or wide character) - -> (Add -q to work around this; once it doesn't need to print the filename, -> it can act on it ok!) diff --git a/doc/bugs/unhappy_without_UTF8_locale.mdwn b/doc/bugs/unhappy_without_UTF8_locale.mdwn new file mode 100644 index 000000000..6f1df4fab --- /dev/null +++ b/doc/bugs/unhappy_without_UTF8_locale.mdwn @@ -0,0 +1,33 @@ +Try unsetting LANG and passing git-annex unicode filenames. + + joey@gnu:~/tmp/aa>git annex add ./Üa + add add add add git-annex: <stdout>: commitAndReleaseBuffer: invalid + argument (Invalid or incomplete multibyte or wide character) + +The same problem can be seen with a simple haskell program: + + import System.Environment + import Codec.Binary.UTF8.String + main = do + args <- getArgs + putStrLn $ decodeString $ args !! 0 + + joey@gnu:~/src/git-annex>LANG= runghc ~/foo.hs Ü + foo.hs: <stdout>: hPutChar: invalid argument (Invalid or incomplete multibyte or wide character) + +(The call to `decodeString` is necessary to make the input +unicode string be displayed properly in a utf8 locale, but +does not contribute to this problem.) + +I guess that haskell is setting the IO encoding to latin1, which +is [documented](http://haskell.org/ghc/docs/latest/html/libraries/base/System-IO.html#v:latin1) +to error out on characters > 255. + +So this program doesn't have the problem -- but may output garbage +on non-utf-8 capable terminals: + + import System.IO + main = do + hSetEncoding stdout utf8 + args <- getArgs + putStrLn $ decodeString $ args !! 0 |