summaryrefslogtreecommitdiff
path: root/doc/bugs/problems_with_utf8_names.mdwn
diff options
context:
space:
mode:
authorGravatar Joey Hess <joey@kitenet.net>2012-02-02 15:41:22 -0400
committerGravatar Joey Hess <joey@kitenet.net>2012-02-02 15:41:22 -0400
commit828df56453d9b0a1483d5c85e6ca739b158883d3 (patch)
tree35fd016a078791c18b100aad8f7789df2bf9d143 /doc/bugs/problems_with_utf8_names.mdwn
parentfc8a1d213b3683253923529918c84c91a75448fa (diff)
update; newghc-edges branch
Diffstat (limited to 'doc/bugs/problems_with_utf8_names.mdwn')
-rw-r--r--doc/bugs/problems_with_utf8_names.mdwn73
1 files changed, 49 insertions, 24 deletions
diff --git a/doc/bugs/problems_with_utf8_names.mdwn b/doc/bugs/problems_with_utf8_names.mdwn
index c9ca1e3b0..ac110a6ae 100644
--- a/doc/bugs/problems_with_utf8_names.mdwn
+++ b/doc/bugs/problems_with_utf8_names.mdwn
@@ -1,35 +1,60 @@
This bug is reopened to track some new UTF-8 filename issues caused by GHC
7.4. In this version of GHC, git-annex's hack to support filenames in any
encoding no longer works. Even unicode filenames fail to work when
-git-annex is built with 7.4. --[[Joey]]
-
-> What's going on exactly? The new ghc, when presented with
-> a String of raw bytes like "fo\194\161", and asked to do
-> something like `getSymbolicLinkStatus`, encodes it
-> as unicode, yielding "fo\303\202\302\241". Which is
-> not the same as the original filename.
-
-The new ghc requires a new data type, `RawFilePath` be used if you
-don't want to impose utf-8 filenames on your users. I have a `newghc` branch
-in git where I am trying to convert it to use `RawFilePath`. However, since
-there is no way to cast a `FilePath` to a `RawFilePath` or back (because
-the encoding of `RawFilePath` is not specified), this means changing
-essentially all of git-annex. Even the filenames used for keys in
-`.git/annex/objects` need to use the new data type. --[[Joey]]
-
-> Actually it may not be that bad. A `RawFilePath` contains only bytes,
-> so it can be cast to a string, containing encoded characters. That
-> string can then be 1) output in binary mode or 2) manipulated
-> in ways that do not add characters larger than 255, and cast back to
-> a `RawFilePath`. While not type-safe, such casts should at least
-> help during bootstrapping, and might allow for a quick fix that only
-> changes to `RawFilePath` at the edges.
+git-annex is built with 7.4. --[[Joey]]
**As a stopgap workaround**, I have made a branch `unicode-only`. This
makes git-annex work with unicode filenames with ghc 7.4, but *only*
unicode filenames. If you have filenames with some other encoding, you're
out in the cold, and it will probably just crash with a error about wrong
-encoding. --[[Joey]]
+encoding.
+
+## analysis
+
+What's going on exactly? The new ghc, when presented with
+a String of raw bytes like "fo\194\161", and asked to do
+something like `getSymbolicLinkStatus`, encodes it
+as unicode, yielding "fo\303\202\302\241". Which is
+not the same as the original filename, assuming it was "fo¡".
+
+The new ghc requires a new data type, `RawFilePath` be used if you
+don't want to impose utf-8 filenames on your users.
+
+The available `RawFilePath` support is quite low-level, so all the nice
+readFile and writeFile code, etc has to be reimplemented. So do any utility
+libraries that do things with FilePaths, if you need them to use
+RawFilePaths. Until the haskell ecosystem adapts to `RawFilePath`
+(if it does), using it broadly, as git-annex needs to, will be difficult.
+
+## newghc branch
+
+I have a `newghc` branch in git where I am trying to convert it to use
+`RawFilePath`. However, since there is no way to cast a `FilePath` to a
+`RawFilePath` or back (because the encoding of `RawFilePath` is not
+specified), this means changing essentially all of git-annex. Even the
+filenames used for keys in `.git/annex/objects` need to use the new data
+type. I didn't get very far on this branch.
+
+## newghc-edges branch
+
+I have a `newghc-edges` branch in git, trying a different approach.
+
+A `RawFilePath` contains only bytes, so it can actually be cast to a string,
+containing encoded characters. That string can then be 1) output in binary
+mode or 2) manipulated in ways that do not add characters larger than 255,
+and cast back to a `RawFilePath`. While not type-safe, such casts should at
+least help during bootstrapping, and might allow for a quick fix that only
+changes to `RawFilePath` at the edges.
+
+The branch contains an almost complete, although probably also buggy
+conversion using this method. It is missing wrappers for a
+few things like `readFile` and `writeFile` but otherwise seems to
+basically work.
+
+Is this a suitable approach for merging into `master`? It's nasty,
+being not type safe, having to reimplent/copy+modify random bits of
+libraries, etc. The nastiness is contained, though, in a single file,
+of only a few hundred lines of code. --[[Joey]]
----