From 6c58a58393a1d6d2257cb9e72e534387561943d7 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 28 Dec 2010 15:28:23 -0400 Subject: details.. --- doc/bugs/problems_with_utf8_names.mdwn | 42 ++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) (limited to 'doc/bugs/problems_with_utf8_names.mdwn') diff --git a/doc/bugs/problems_with_utf8_names.mdwn b/doc/bugs/problems_with_utf8_names.mdwn index 521f92405..30f3495f4 100644 --- a/doc/bugs/problems_with_utf8_names.mdwn +++ b/doc/bugs/problems_with_utf8_names.mdwn @@ -18,3 +18,45 @@ There are problems with displaying filenames in UTF8 encoding, as shown here: $ It looks like the common latin1-to-UTF8 encoding. Functionality other than otuput seems not to be affected. + +> Yes, I believe that git-annex is reading filename data from git +> as a stream of char8s, and not decoding unicode in it into logical +> characters. +> Haskell then I guess, tries to unicode encode it when it's output to +> the console. +> This only seems to matter WRT its output to the console; the data +> does not get mangled internally and so it accesses the right files +> under the hood. +> +> I am too new to haskell to really have a handle on how to handle +> unicode and other encodings issues with it. In general, there are three +> valid approaches: --[[Joey]] +> +> 1. Convert all input data to unicode and be unicode clean end-to-end +> internally. Problimatic here since filenames may not necessarily be +> encoded in utf-8 (an archive could have historical filenames using +> varying encodings), and you don't want which files are accessed to +> depend on locale settings. +> 1. Keep input and internal data un-decoded, but decode it when +> outputting a filename (assuming the filename is encoded using the +> user's configured encoding), and allow haskell's output encoding to then +> encode it according to the user's locale configuration. +> 1. Avoid encodings entirely. Mostly what I'm doing now; probably +> could find a way to disable encoding of console output. Then the raw +> filename would be displayed, which should work ok. git-annex does +> not really need to pull apart filenames; they are almost entirely +> opaque blobs. I guess that the `--exclude` option is the exception +> to that, but it is currently not unicode safe anyway. +> One other possible +> issue would be that this could cause problems if git-annex were +> translated. +> +> BTW, for more fun, try unsetting LANG, and then you can see +> stuff like this: + + joey@gnu:~/tmp/aa>git annex add ./Üa + add add add add git-annex: : commitAndReleaseBuffer: invalid + argument (Invalid or incomplete multibyte or wide character) + +> (Add -q to work around this; once it doesn't need to print the filename, +> it can act on it ok!) -- cgit v1.2.3