summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGravatar Joey Hess <joey@kitenet.net>2010-12-28 15:28:23 -0400
committerGravatar Joey Hess <joey@kitenet.net>2010-12-28 15:28:23 -0400
commit6c58a58393a1d6d2257cb9e72e534387561943d7 (patch)
tree8a98edbdfa10675ba7b7850670ebf2323f466b80
parent022e0c7751db23805169c5429851903a3d482cb2 (diff)
details..
-rw-r--r--doc/bugs/problems_with_utf8_names.mdwn42
1 files changed, 42 insertions, 0 deletions
diff --git a/doc/bugs/problems_with_utf8_names.mdwn b/doc/bugs/problems_with_utf8_names.mdwn
index 521f92405..30f3495f4 100644
--- a/doc/bugs/problems_with_utf8_names.mdwn
+++ b/doc/bugs/problems_with_utf8_names.mdwn
@@ -18,3 +18,45 @@ There are problems with displaying filenames in UTF8 encoding, as shown here:
$
It looks like the common latin1-to-UTF8 encoding. Functionality other than otuput seems not to be affected.
+
+> Yes, I believe that git-annex is reading filename data from git
+> as a stream of char8s, and not decoding unicode in it into logical
+> characters.
+> Haskell then I guess, tries to unicode encode it when it's output to
+> the console.
+> This only seems to matter WRT its output to the console; the data
+> does not get mangled internally and so it accesses the right files
+> under the hood.
+>
+> I am too new to haskell to really have a handle on how to handle
+> unicode and other encodings issues with it. In general, there are three
+> valid approaches: --[[Joey]]
+>
+> 1. Convert all input data to unicode and be unicode clean end-to-end
+> internally. Problimatic here since filenames may not necessarily be
+> encoded in utf-8 (an archive could have historical filenames using
+> varying encodings), and you don't want which files are accessed to
+> depend on locale settings.
+> 1. Keep input and internal data un-decoded, but decode it when
+> outputting a filename (assuming the filename is encoded using the
+> user's configured encoding), and allow haskell's output encoding to then
+> encode it according to the user's locale configuration.
+> 1. Avoid encodings entirely. Mostly what I'm doing now; probably
+> could find a way to disable encoding of console output. Then the raw
+> filename would be displayed, which should work ok. git-annex does
+> not really need to pull apart filenames; they are almost entirely
+> opaque blobs. I guess that the `--exclude` option is the exception
+> to that, but it is currently not unicode safe anyway.
+> One other possible
+> issue would be that this could cause problems if git-annex were
+> translated.
+>
+> BTW, for more fun, try unsetting LANG, and then you can see
+> stuff like this:
+
+ joey@gnu:~/tmp/aa>git annex add ./Üa
+ add add add add git-annex: <stdout>: commitAndReleaseBuffer: invalid
+ argument (Invalid or incomplete multibyte or wide character)
+
+> (Add -q to work around this; once it doesn't need to print the filename,
+> it can act on it ok!)