summaryrefslogtreecommitdiff
path: root/doc/bugs/problems_with_utf8_names.mdwn
diff options
context:
space:
mode:
Diffstat (limited to 'doc/bugs/problems_with_utf8_names.mdwn')
-rw-r--r--doc/bugs/problems_with_utf8_names.mdwn81
1 files changed, 0 insertions, 81 deletions
diff --git a/doc/bugs/problems_with_utf8_names.mdwn b/doc/bugs/problems_with_utf8_names.mdwn
deleted file mode 100644
index aeeb16be6..000000000
--- a/doc/bugs/problems_with_utf8_names.mdwn
+++ /dev/null
@@ -1,81 +0,0 @@
-This bug is reopened to track some new UTF-8 filename issues caused by GHC
-7.4. In this version of GHC, git-annex's hack to support filenames in any
-encoding no longer works. Even unicode filenames fail to work when
-git-annex is built with 7.4. --[[Joey]]
-
-This bug is now fixed in current master. Once again, git-annex will work
-for all filename encodings, and all system encodings. It will
-only build with the new GHC. [[done]] --[[Joey]]
-
-----
-
-Old, now fixed bug report follows:
-
-There are problems with displaying filenames in UTF8 encoding, as shown here:
-
- $ echo $LANG
- en_GB.UTF-8
- $ git init
- $ git annex init test
- [...]
- $ touch "Umlaut Ü.txt"
- $ git annex add Uml*
- add Umlaut Ã.txt ok
- (Recording state in git...)
- $ find -name U\* | hexdump -C
- 00000000 2e 2f 55 6d 6c 61 75 74 20 c3 9c 2e 74 78 74 0a |./Umlaut ...txt.|
- 00000010
- $ git annex find | hexdump -C
- 00000000 55 6d 6c 61 75 74 20 c3 83 c2 9c 2e 74 78 74 0a |Umlaut .....txt.|
- 00000010
- $
-
-It looks like the common latin1-to-UTF8 encoding. Functionality other than otuput seems not to be affected.
-
-> Yes, I believe that git-annex is reading filename data from git
-> as a stream of char8s, and not decoding unicode in it into logical
-> characters.
-> Haskell then I guess, tries to unicode encode it when it's output to
-> the console.
-> This only seems to matter WRT its output to the console; the data
-> does not get mangled internally and so it accesses the right files
-> under the hood.
->
-> I am too new to haskell to really have a handle on how to handle
-> unicode and other encodings issues with it. In general, there are three
-> valid approaches: --[[Joey]]
->
-> 1. Convert all input data to unicode and be unicode clean end-to-end
-> internally. Problimatic here since filenames may not necessarily be
-> encoded in utf-8 (an archive could have historical filenames using
-> varying encodings), and you don't want which files are accessed to
-> depend on locale settings.
-> > I tried to do this by making parts of GitRepo call
-> > Codec.Binary.UTF8.String.decodeString when reading filenames from
-> > git. This seemed to break attempts to operate on the files,
-> > weirdly encoded strings were seen in syscalls in strace.
-> 1. Keep input and internal data un-decoded, but decode it when
-> outputting a filename (assuming the filename is encoded using the
-> user's configured encoding), and allow haskell's output encoding to then
-> encode it according to the user's locale configuration.
-> > This is now implemented. I'm not very happy that I have to watch
-> > out for any place that a filename is output and call `filePathToString`
-> > on it, but there are really not too many such places in git-annex.
-> >
-> > Note that this only affects filenames apparently.
-> > (Names of files in the annex, and also some places where names
-> > of keys are displayed.) Utf-8 in the uuid.map file etc seems
-> > to be handled cleanly.
-> 1. Avoid encodings entirely. Mostly what I'm doing now; probably
-> could find a way to disable encoding of console output. Then the raw
-> filename would be displayed, which should work ok. git-annex does
-> not really need to pull apart filenames; they are almost entirely
-> opaque blobs. I guess that the `--exclude` option is the exception
-> to that, but it is currently not unicode safe anyway. (Update: tried
-> `--exclude` again, seems it is unicode clean..)
-> One other possible
-> issue would be that this could cause problems if git-annex were
-> translated.
-> > On second thought, I switched to this. Any decoding of a filename
-> > is going to make someone unhappy; the previous approach broke
-> > non-utf8 filenames.