Fix storing of filenames of v6 unlocked files when the filename is not representable in the current locale.

This is a mostly backwards compatable change. I broke backwards compatability in the case where a filename starts with double-quote. That seems likely to be very rare, and v6 unlocked files are a new feature anyway, and fsck needs to fix missing associated file mappings anyway. So, I decided that is good enough. The encoding used is to just show the String when it contains a problem character. While that adds some overhead to addAssociatedFile and removeAssociatedFile, those are not called very often. This approach has minimal decode overhead, because most filenames won't be encoded that way, and it only has to look for the leading double-quote to skip the expensive read. So, getAssociatedFiles remains fast. I did consider using ByteString instead, but getting a FilePath converted with all chars intact, even surrigates, is difficult, and it looks like instance PersistField ByteString uses Text, which I don't trust for problem encoded data. It would probably be slower too, and it would make the database less easy to inspect manually.
author: Joey Hess <joeyh@joeyh.name> 2016-02-14 16:37:25 -0400
committer: Joey Hess <joeyh@joeyh.name> 2016-02-14 16:37:25 -0400
commit: f9dfeaf801da2e4d5879b3de5895dc3cef68a329 (patch)
tree: 56c23f16087ccd475e66a66d2b98abbec4e04e07 /doc
parent: 3c72634bb790eac407291b579afb4f501b3a4a11 (diff)
1 files changed, 31 insertions, 0 deletions
diff --git a/doc/bugs/__39__git_annex_get__39___fails_for_unlocked_files_with_special_characters___40__e.g._umlauts__41___when_using_precompiled_version_6.20160126-g2336107_/comment_1_8d6bdb32884cb80e444c7739c743c9de._comment b/doc/bugs/__39__git_annex_get__39___fails_for_unlocked_files_with_special_characters___40__e.g._umlauts__41___when_using_precompiled_version_6.20160126-g2336107_/comment_1_8d6bdb32884cb80e444c7739c743c9de._comment
new file mode 100644
index 000000000..067182f18
--- /dev/null
+++ b/doc/bugs/__39__git_annex_get__39___fails_for_unlocked_files_with_special_characters___40__e.g._umlauts__41___when_using_precompiled_version_6.20160126-g2336107_/comment_1_8d6bdb32884cb80e444c7739c743c9de._comment
@@ -0,0 +1,31 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 1"""
+ date="2016-02-14T19:19:46Z"
+ content="""
+Reproduced using LANG=C.
+
+This is a problem with the filename stored in the keys db. In the first
+repo, it has:
+
+	VALUES(1,'SHA256E-s8--d1d0c59000f7c0d71485b051c9ca3f25f7afa84f0be5fea98fe1e12f3f898f44','test_öüä');
+
+However, in the clone:
+
+	VALUES(1,'SHA256E-s8--d1d0c59000f7c0d71485b051c9ca3f25f7afa84f0be5fea98fe1e12f3f898f44','test_������');
+
+So, it's lost the correct filename there. Since it doesn't
+find the file with the messed up name, it doesn't replace the file content.
+
+The problem is not with decoding git's C-style character encoding; that
+happens ok yielding `"test_\56515\56502\56515\56508\56515\56484"`. 
+But, that does not seem to get stored in the database correctly.
+
+Seems that these unicode surrigates are not handled by the sqlite layer.
+The surrigates are being used because LANG=C does not support
+unicode. This could also happen when in a (working) utf-8 locale, when
+the filename is not utf-8 encoded.
+
+So, need to escape strings containing such surrigates before passing to
+SQL. In a backwards-compatible way. Done.
+"""]]
author	Joey Hess <joeyh@joeyh.name>	2016-02-14 16:37:25 -0400
committer	Joey Hess <joeyh@joeyh.name>	2016-02-14 16:37:25 -0400
commit	f9dfeaf801da2e4d5879b3de5895dc3cef68a329 (patch)
tree	56c23f16087ccd475e66a66d2b98abbec4e04e07 /doc
parent	3c72634bb790eac407291b579afb4f501b3a4a11 (diff)