Merge branch 'master' of ssh://git-annex.branchable.com

author: Joey Hess <joey@kitenet.net> 2013-11-07 14:11:17 -0400
committer: Joey Hess <joey@kitenet.net> 2013-11-07 14:11:17 -0400
commit: 37d9a6a9f477ee66f810cd5f4d3320734fca0c11 (patch)
tree: 2ff54405c123717200528494cdb95c56d5bb762e
parent: 698594b8bf336e7d5115608e5e4ba8bd23587aa8 (diff)
parent: 45eb1c18804575d8cb4ac7c07d6734a2f9ebdcbd (diff)
9 files changed, 156 insertions, 0 deletions
diff --git a/doc/bugs/Mac_OS_10.9_GPG_error_adding_S3_repo/comment_1_d95accb43bd18cc9acbbf1d4069f86b3._comment b/doc/bugs/Mac_OS_10.9_GPG_error_adding_S3_repo/comment_1_d95accb43bd18cc9acbbf1d4069f86b3._comment
new file mode 100644
index 000000000..0f1e34e31
--- /dev/null
+++ b/doc/bugs/Mac_OS_10.9_GPG_error_adding_S3_repo/comment_1_d95accb43bd18cc9acbbf1d4069f86b3._comment
@@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="https://www.google.com/accounts/o8/id?id=AItOawmZgZuUhZlHpd_AbbcixY0QQiutb2I7GWY"
+ nickname="Jimmy"
+ subject="S3 works without encryption"
+ date="2013-11-06T21:09:26Z"
+ content="""
+Not surprisingly, S3 repos work without encryption.
+"""]]
diff --git a/doc/devblog/day_50__grab_bag/comment_1_01846f6494fe843889391fd09fd127a0._comment b/doc/devblog/day_50__grab_bag/comment_1_01846f6494fe843889391fd09fd127a0._comment
new file mode 100644
index 000000000..1c1717180
--- /dev/null
+++ b/doc/devblog/day_50__grab_bag/comment_1_01846f6494fe843889391fd09fd127a0._comment
@@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="https://me.yahoo.com/a/2grhJvAC049fJnvALDXek.6MRZMTlg--#eec89"
+ nickname="John"
+ subject="OS X builds"
+ date="2013-11-07T05:12:13Z"
+ content="""
+Joey, were you not interested in my offer of an OS X build server that I've posted elsewhere on this list, and also in e-mail to you?
+"""]]
diff --git a/doc/devblog/day_50__grab_bag/comment_2_12736014aa2c1af81e4b83072505e7d5._comment b/doc/devblog/day_50__grab_bag/comment_2_12736014aa2c1af81e4b83072505e7d5._comment
new file mode 100644
index 000000000..70d7b7cd1
--- /dev/null
+++ b/doc/devblog/day_50__grab_bag/comment_2_12736014aa2c1af81e4b83072505e7d5._comment
@@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="http://joeyh.name/"
+ ip="209.250.56.47"
+ subject="comment 2"
+ date="2013-11-07T16:01:00Z"
+ content="""
+John, must have missed that; can't see to find it anywhere..
+"""]]
diff --git a/doc/forum/Effectively_replicating_backup_files.mdwn b/doc/forum/Effectively_replicating_backup_files.mdwn
new file mode 100644
index 000000000..85d8f4b52
--- /dev/null
+++ b/doc/forum/Effectively_replicating_backup_files.mdwn
@@ -0,0 +1,25 @@
+I currently use duply/duplicity to back up my networked computers to my home server. I have two external HDDs and, every week or so, I bring one of these home and copy the backup files to the hard drive (I leave them on the server to easily restore files and because I have a large hard drive in that). I use some hand-written scripts to keep a copy these backup files in the cloud (Ubuntu One) until they have been copied to both hard drives, ensuring that there are always two copies of the files somewhere offsite. Out of paranoia, I also have some "standalone backups" that are just huge encrypted archives of important folders (say, my entire Photos directory) as at a certain date - in case for some reason duplicity ever stops working or I need to roll something back to a version years earlier. I am less worried about these standalone backups and manually keep one copy of each somewhere.
+
+It sounds like Git-Annex could automate things quite nicely (and give me some neat extras, like knowing where they were). This is how I understand I should do it, but please let me know if it is the right approach or if you have any suggestions:
+
+1. Create a folder on my server called "annex" and make a Git-Annex "large backup" repository in there.
+2. Create a folder within that called "archive" and put a "backup" folder within that. I understand that having the backups within an archive folder will mean that they aren't automatically copied to my desktop machines etc. 
+3. Within that "backup" folder, create two folders, one called "duplicity" and one called "standalone". Put the backups in the respective folders.
+4. Set up gcrypt Git-Annex repositories on my two external HDDs as "small backups". This seems to just start copying files across. That surprised me, as the files are in the archive folder and I thought the default was numcopies=1. Is there some autosync option that I need to turn off? Ideally, I would like it to encrypt/decrypt primarily with my server GPG key (which I'm not worried about copying around my computers), but also encrypt to my personal GPG key (where I'd only put my public key on the server, but I know I will not lose the secret key for that). Am I right that to do that I would need to set the repos up manually with:
+
+    git init --bare /mnt/externalHDD1
+
+    git annex initremote externalHDD1 type=gcrypt gitrepo=/mnt/externalHDD1 keyid=$serverkey keyid=$personalkey
+
+    git annex sync externalHDD1
+
+    Or should the gitrepo be the location of my main Git-Annex repository? How do I make it sync up with my other repos?
+
+5. I understand that I would then need to set numcopies=3 in a .gitattributes file in the "archive/backup/duplicity" directory and, say, a numcopies=2 in the "archive/backup/standalone".
+6. I could then add a cloud repository as a "transfer" repository and Git-Annex should only keep files on that that are not already in the right number of places (similar to what my scripts are doing now).
+7. I have recently upgraded my hard drive, so I have my old 1TB internal hard drive that I will be putting in a cupboard somewhere. I was thinking that I could make this an archive drive for things like one copy of my duplicity/standalone backups. I wouldn't want it to be the only copy of anything. If I just set it as an archive drive, would this work?
+8. Are there more clever ways of doing this? I consider my external HDDs and the cloud repo as "offsite" repositories and ideally there would always be one copy of my backups offsite (in addition to at least three overall). There would also ideally be one of each of my files "live" (in most cases my server) that could instantly push files into a cloud repo and then to wherever I am. Is there any ability to put repositories in groups and write rules like that?
+
+Any thoughts greatly appreciated!
+
+Aaron
diff --git a/doc/forum/Effectively_replicating_backup_files/comment_1_b1ab0da82db076c5244b0dcc95282ddd._comment b/doc/forum/Effectively_replicating_backup_files/comment_1_b1ab0da82db076c5244b0dcc95282ddd._comment
new file mode 100644
index 000000000..54f7f782a
--- /dev/null
+++ b/doc/forum/Effectively_replicating_backup_files/comment_1_b1ab0da82db076c5244b0dcc95282ddd._comment
@@ -0,0 +1,14 @@
+[[!comment format=mdwn
+ username="http://joeyh.name/"
+ ip="209.250.56.47"
+ subject="comment 1"
+ date="2013-11-07T16:42:14Z"
+ content="""
+This all sounds reasonable to me.
+
+gitrepo= is the location of the repository you are setting up with gcrypt, not your existing git-annex repository.
+
+numcopies configures the lower bound of the number of copies, not the upper bound. There's no \"small backup\" type, you must mean either small archive, or incremental backup. Either type of repository is going to want to have files that have not previously been archived, or backed up.
+
+You might eventually want to write your own [[preferred_content]] expressions to handle offsite repositories. I'd recommend starting simple and building up.
+"""]]
diff --git a/doc/forum/_Does_git_annex_find___40____38___friends__41___batch_queries_to_the_location_log__63__.mdwn b/doc/forum/_Does_git_annex_find___40____38___friends__41___batch_queries_to_the_location_log__63__.mdwn
new file mode 100644
index 000000000..a9db915da
--- /dev/null
+++ b/doc/forum/_Does_git_annex_find___40____38___friends__41___batch_queries_to_the_location_log__63__.mdwn
@@ -0,0 +1,45 @@
+Hi,
+
+Some time ago I asked [[here|git_annex_copy_--fast_--to_blah_much_slower_than_--from_blah]]
+about possible improvements in git `copy --fast --to`, since it was painfully slow
+on moderately large repos.
+
+Now I found a way to make it much faster for my particular use case, by
+accessing some annex internals. And I realized that maybe commands like `git
+annex find --in=repo` do not batch queries to the location log. This is based on
+the following timings, on a new repo (only a few commits) and about 30k files.
+
+    > time git annex find --in=skynet > /dev/null
+
+    real	0m55.838s
+    user	0m30.000s
+    sys	0m1.583s
+
+
+    > time git ls-tree -r git-annex | cut -d ' ' -f 3 | cut -f 1 | git cat-file --batch > /dev/null
+
+    real	0m0.334s
+    user	0m0.517s
+    sys	0m0.030s
+
+Those numbers are on linux (with an already warm file cache) and an ext4 filesystem on a SSD.
+
+The second command above is feeding a list of objects to a single `git cat-file`
+process that cats them all to stdout, preceeding every file dump by the object
+being cat-ed. It is a trivial matter to parse this output and use it for
+whatever annex needs.
+
+Above I wrote a `git ls-tree` on the git-annex branch for simplicity, but we could
+just as well do a `ls-tree | ... | git cat-file` on HEAD to get the keys for the
+annexed files matching some path and then feed those keys to a cat-file on
+the git-annex branch. And this still would be an order of magnitude faster than
+what currently annex seems to do.
+
+I'm assuming the bottleneck is in that annex does not batch the `cat-file`, as the rest of logic needed for a find will be fast. Is that right? 
+
+Now, if the queries to the location log for `copy --to` and `find` could be batched this way, the
+performance of several useful things to do, like checking how many annexed files
+we are missing, would be bastly improved. Hell, I could even put that number on
+the command line prompt!
+
+I'm not yet very fluent in Haskell, but I'm willing to help if this is something that makes sense and can be done.
diff --git a/doc/forum/_Does_git_annex_find___40____38___friends__41___batch_queries_to_the_location_log__63__/comment_1_01cbfc513c790faef3a3ede5315d3589._comment b/doc/forum/_Does_git_annex_find___40____38___friends__41___batch_queries_to_the_location_log__63__/comment_1_01cbfc513c790faef3a3ede5315d3589._comment
new file mode 100644
index 000000000..6b4a6b2a0
--- /dev/null
+++ b/doc/forum/_Does_git_annex_find___40____38___friends__41___batch_queries_to_the_location_log__63__/comment_1_01cbfc513c790faef3a3ede5315d3589._comment
@@ -0,0 +1,10 @@
+[[!comment format=mdwn
+ username="http://joeyh.name/"
+ ip="209.250.56.47"
+ subject="comment 1"
+ date="2013-11-06T22:21:33Z"
+ content="""
+git-annex uses cat-file --batch, yes. You can verify this with --debug. Or you can read Annex/CatFile.hs and Git/CatFile.hs
+
+git-annex has to ensure that the git-annex branch is up-to-date and that any info synced into the repository is merged into it. This can require several calls to git log. Your command does not do that. git-annex find also runs `git ls-files --cached`, which has to examine the state of the index and of files on disk, in order to only show files that are in the working tree. Your command also omits that.
+"""]]
diff --git a/doc/forum/_Does_git_annex_find___40____38___friends__41___batch_queries_to_the_location_log__63__/comment_2_fe28dfb360caa12d5d5bc186def3eb45._comment b/doc/forum/_Does_git_annex_find___40____38___friends__41___batch_queries_to_the_location_log__63__/comment_2_fe28dfb360caa12d5d5bc186def3eb45._comment
new file mode 100644
index 000000000..4ba4c8264
--- /dev/null
+++ b/doc/forum/_Does_git_annex_find___40____38___friends__41___batch_queries_to_the_location_log__63__/comment_2_fe28dfb360caa12d5d5bc186def3eb45._comment
@@ -0,0 +1,35 @@
+[[!comment format=mdwn
+ username="https://www.google.com/accounts/o8/id?id=AItOawkkyBDsfOB7JZvPZ4a8F3rwv0wk6Nb9n48"
+ nickname="Abdó"
+ subject="comment 2"
+ date="2013-11-06T23:14:03Z"
+ content="""
+Ok, then I don't understand where annex spends its time. git annex takes 55
+seconds! vs less than a second for a batched query on all the keys in the
+location log. Checking that branches are in sync, or traversing the working dir
+shouldn't amount the extra 54 seconds! At least not on a recently synced repo
+with up to date index and clean working dir.
+
+> git-annex has to ensure that the git-annex branch is up-to-date and that any info synced into the repository is merged into it. This can require several calls to git log
+
+Ok, I understand that. Checking that should be typically fast though, isn't it? On a repo that has just been synced, it doesn't need to go very far on the log.
+
+> git-annex find also runs git ls-files --cached, which has to examine the state of the index and of files on disk, in order to only show files that are in the working tree
+
+I understand that too. For my particular use case, I know I do the `git copy` when the
+repo is in sync and the working dir has no uncommited changes. So I use HEAD to retrieve the keys for
+the files in the working tree. I do something like that:
+
+    time git ls-tree -r HEAD | grep -e '^120000' | cut -d ' ' -f 3 | cut -f 1 | git cat-file --batch > /dev/null
+
+    real	0m0.178s
+    user	0m0.277s
+    sys 	0m0.037s
+
+That plus some fast parsing of the output gets the list of keys for the files in HEAD in less than a second. Where do the 54 extra seconds hide, then?
+
+Mm... how does annex retrieve the keys for files in the working tree? Does it follow
+the actual symlinks on the filesystem? I can believe that following 30k symlinks may be slow (although not 55 second slow).
+
+Sorry for being so insistent on this... It is just that I do think the same can be done much faster, and such an improvement in performance would be very interesting, not only for me.
+"""]]
diff --git a/doc/todo/wishlist:_detection_of_merge_conflicts.mdwn b/doc/todo/wishlist:_detection_of_merge_conflicts.mdwn
new file mode 100644
index 000000000..1824a91ee
--- /dev/null
+++ b/doc/todo/wishlist:_detection_of_merge_conflicts.mdwn
@@ -0,0 +1,3 @@
+A conflict during sync or merge is something that requires user intervention, or at least notification. For that reason it would be nice if git annex returned a nonzero exit status when such a conflict happened during a sync or a merge. This is what git does after a conflicting pull, and would make it easier to spot a conflict in automated syncs without having to parse annex output or the logs.
+
+Also, it would be nice if your new `git annex status` were able to inform about remaining conflicts in the repo, for instance by reporting files with variant-XXX suffix.
author	Joey Hess <joey@kitenet.net>	2013-11-07 14:11:17 -0400
committer	Joey Hess <joey@kitenet.net>	2013-11-07 14:11:17 -0400
commit	37d9a6a9f477ee66f810cd5f4d3320734fca0c11 (patch)
tree	2ff54405c123717200528494cdb95c56d5bb762e
parent	698594b8bf336e7d5115608e5e4ba8bd23587aa8 (diff)
parent	45eb1c18804575d8cb4ac7c07d6734a2f9ebdcbd (diff)