diff options
Diffstat (limited to 'doc/bugs')
-rw-r--r-- | doc/bugs/problems_with_utf8_names.mdwn | 22 | ||||
-rw-r--r-- | doc/bugs/unhappy_without_UTF8_locale.mdwn | 33 |
2 files changed, 45 insertions, 10 deletions
diff --git a/doc/bugs/problems_with_utf8_names.mdwn b/doc/bugs/problems_with_utf8_names.mdwn index 30f3495f4..257f8dff2 100644 --- a/doc/bugs/problems_with_utf8_names.mdwn +++ b/doc/bugs/problems_with_utf8_names.mdwn @@ -37,10 +37,22 @@ It looks like the common latin1-to-UTF8 encoding. Functionality other than otupu > encoded in utf-8 (an archive could have historical filenames using > varying encodings), and you don't want which files are accessed to > depend on locale settings. +> > I tried to do this by making parts of GitRepo call +> > Codec.Binary.UTF8.String.decodeString when reading filenames from +> > git. This seemed to break attempts to operate on the files, +> > weirdly encoded strings were seen in syscalls in strace. > 1. Keep input and internal data un-decoded, but decode it when > outputting a filename (assuming the filename is encoded using the > user's configured encoding), and allow haskell's output encoding to then > encode it according to the user's locale configuration. +> > This is now [[implemented|done]]. I'm not very happy that I have to watch +> > out for any place that a filename is output and call `showFile` +> > on it, but there are really not too many such places in git-annex. +> > +> > Note that this only affects filenames apparently. +> > (Names of files in the annex, and also some places where names +> > of keys are displayed.) Utf-8 in the uuid.map file etc seems +> > to be handled cleanly. > 1. Avoid encodings entirely. Mostly what I'm doing now; probably > could find a way to disable encoding of console output. Then the raw > filename would be displayed, which should work ok. git-annex does @@ -50,13 +62,3 @@ It looks like the common latin1-to-UTF8 encoding. Functionality other than otupu > One other possible > issue would be that this could cause problems if git-annex were > translated. -> -> BTW, for more fun, try unsetting LANG, and then you can see -> stuff like this: - - joey@gnu:~/tmp/aa>git annex add ./Üa - add add add add git-annex: <stdout>: commitAndReleaseBuffer: invalid - argument (Invalid or incomplete multibyte or wide character) - -> (Add -q to work around this; once it doesn't need to print the filename, -> it can act on it ok!) diff --git a/doc/bugs/unhappy_without_UTF8_locale.mdwn b/doc/bugs/unhappy_without_UTF8_locale.mdwn new file mode 100644 index 000000000..6f1df4fab --- /dev/null +++ b/doc/bugs/unhappy_without_UTF8_locale.mdwn @@ -0,0 +1,33 @@ +Try unsetting LANG and passing git-annex unicode filenames. + + joey@gnu:~/tmp/aa>git annex add ./Üa + add add add add git-annex: <stdout>: commitAndReleaseBuffer: invalid + argument (Invalid or incomplete multibyte or wide character) + +The same problem can be seen with a simple haskell program: + + import System.Environment + import Codec.Binary.UTF8.String + main = do + args <- getArgs + putStrLn $ decodeString $ args !! 0 + + joey@gnu:~/src/git-annex>LANG= runghc ~/foo.hs Ü + foo.hs: <stdout>: hPutChar: invalid argument (Invalid or incomplete multibyte or wide character) + +(The call to `decodeString` is necessary to make the input +unicode string be displayed properly in a utf8 locale, but +does not contribute to this problem.) + +I guess that haskell is setting the IO encoding to latin1, which +is [documented](http://haskell.org/ghc/docs/latest/html/libraries/base/System-IO.html#v:latin1) +to error out on characters > 255. + +So this program doesn't have the problem -- but may output garbage +on non-utf-8 capable terminals: + + import System.IO + main = do + hSetEncoding stdout utf8 + args <- getArgs + putStrLn $ decodeString $ args !! 0 |