summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGravatar Joey Hess <joey@kitenet.net>2012-07-25 23:18:39 -0400
committerGravatar Joey Hess <joey@kitenet.net>2012-07-25 23:18:39 -0400
commitabe5a73d3f50edc679cd990c0e8e27c36b775d29 (patch)
tree84c14cf012e9dfb9f061428ccfd1f752c7e07c37
parent1ffef3ad75e51b7f66c4ffdd0935a0495042e5ae (diff)
parent3a02c7b635fc1017c05874b8a6f54a91a587651d (diff)
Merge branch 'master' into assistant
-rw-r--r--doc/design/assistant/blog/day_43__simple_scanner.mdwn37
-rw-r--r--doc/design/assistant/syncing.mdwn100
-rw-r--r--doc/forum/Fixing_up_corrupt_annexes.mdwn10
-rw-r--r--doc/forum/Fixing_up_corrupt_annexes/comment_1_cea21f96bcfb56aaab7ea03c1c804d2d._comment7
-rw-r--r--doc/tips/what_to_do_when_you_lose_a_repository.mdwn2
5 files changed, 109 insertions, 47 deletions
diff --git a/doc/design/assistant/blog/day_43__simple_scanner.mdwn b/doc/design/assistant/blog/day_43__simple_scanner.mdwn
new file mode 100644
index 000000000..11ee3cca4
--- /dev/null
+++ b/doc/design/assistant/blog/day_43__simple_scanner.mdwn
@@ -0,0 +1,37 @@
+Milestone: I can run `git annex assistant`, plug in a USB drive, and it
+automatically transfers files to get the USB drive and current repo back in
+sync.
+
+I decided to implement the naive scan, to find files needing to be
+transferred. So it walks through `git ls-files` and checks each file
+in turn. I've deferred less expensive, more sophisticated approaches to later.
+
+I did some work on the TransferQueue, which now keeps track of the length
+of the queue, and can block attempts to add Transfers to it if it gets too
+long. This was a nice use of STM, which let me implement that without using
+any locking.
+
+[[!format haskell """
+atomically $ do
+ sz <- readTVar (queuesize q)
+ if sz <= wantsz
+ then enqueue schedule q t (stubInfo f remote)
+ else retry -- blocks until queuesize changes
+"""]]
+
+Anyway, the point was that, as the scan finds Transfers to do,
+it doesn't build up a really long TransferQueue, but instead is blocked
+from running further until some of the files get transferred. The resulting
+interleaving of the scan thread with transfer threads means that transfers
+start fairly quickly upon a USB drive being plugged in, and kind of hides
+the innefficiencies of the scanner, which will most of the time be
+swamped out by the IO bound large data transfers.
+
+---
+
+At this point, the assistant should do a good job of keeping repositories
+in sync, as long as they're all interconnected, or on removable media
+like USB drives. There's lots more work to be done to handle use cases
+where repositories are not well-connected, but since the assistant's
+[[syncing]] now covers at least a couple of use cases, I'm ready to move
+on to the next phase. [[Webapp]], here we come!
diff --git a/doc/design/assistant/syncing.mdwn b/doc/design/assistant/syncing.mdwn
index cc23f786f..4d7d70022 100644
--- a/doc/design/assistant/syncing.mdwn
+++ b/doc/design/assistant/syncing.mdwn
@@ -5,53 +5,14 @@ all the other git clones, at both the git level and the key/value level.
* At startup, and possibly periodically, or when the network connection
changes, or some heuristic suggests that a remote was disconnected from
- us for a while, queue remotes for processing by the TransferScanner,
- to queue Transfers of files it or we're missing.
-* After git sync, identify content that we don't have that is now available
+ us for a while, queue remotes for processing by the TransferScanner.
+* Ensure that when a remote receives content, and updates its location log,
+ it syncs that update back out. Prerequisite for:
+* After git sync, identify new content that we don't have that is now available
on remotes, and transfer. (Needed when we have a uni-directional connection
- to a remote, so it won't be uploading content to us.)
- But first, need to ensure that when a remote
- receives content, and updates its location log, it syncs that update
- out.
-
-## TransferScanner
-
-The TransferScanner thread needs to find keys that need to be Uploaded
-to a remote, or Downloaded from it.
-
-How to find the keys to transfer? I'd like to avoid potentially
-expensive traversals of the whole git working copy if I can.
-
-One way would be to do a git diff between the (unmerged) git-annex branches
-of the git repo, and its remote. Parse that for lines that add a key to
-either, and queue transfers. That should work fairly efficiently when the
-remote is a git repository. Indeed, git-annex already does such a diff
-when it's doing a union merge of data into the git-annex branch. It
-might even be possible to have the union merge and scan use the same
-git diff data.
-
-But that approach has several problems:
-
-1. The list of keys it would generate wouldn't have associated git
- filenames, so the UI couldn't show the user what files were being
- transferred.
-2. Worse, without filenames, any later features to exclude
- files/directories from being transferred wouldn't work.
-3. Looking at a git diff of the git-annex branches would find keys
- that were added to either side while the two repos were disconnected.
- But if the two repos' keys were not fully in sync before they
- disconnected (which is quite possible; transfers could be incomplete),
- the diff would not show those older out of sync keys.
-
-The remote could also be a special remote. In this case, I have to either
-traverse the git working copy, or perhaps traverse the whole git-annex
-branch (which would have the same problems with filesnames not being
-available).
-
-If a traversal is done, should check all remotes, not just
-one. Probably worth handling the case where a remote is connected
-while in the middle of such a scan, so part of the scan needs to be
-redone to check it.
+ to a remote, so it won't be uploading content to us.) Note: Does not
+ need to use the TransferScanner, if we get and check a list of the changed
+ files.
## longer-term TODO
@@ -75,6 +36,12 @@ redone to check it.
* speed up git syncing by using the cached ssh connection for it too
(will need to use `GIT_SSH`, which needs to point to a command to run,
not a shell command line)
+* Map the network of git repos, and use that map to calculate
+ optimal transfers to keep the data in sync. Currently a naive flood fill
+ is done instead.
+* Find a more efficient way for the TransferScanner to find the transfers
+ that need to be done to sync with a remote. Currently it walks the git
+ working copy and checks each file.
## misc todo
@@ -105,6 +72,47 @@ reachable remote. This is worth doing first, since it's the simplest way to
get the basic functionality of the assistant to work. And we'll need this
anyway.
+## TransferScanner
+
+The TransferScanner thread needs to find keys that need to be Uploaded
+to a remote, or Downloaded from it.
+
+How to find the keys to transfer? I'd like to avoid potentially
+expensive traversals of the whole git working copy if I can.
+(Currently, the TransferScanner does do the naive and possibly expensive
+scan of the git working copy.)
+
+One way would be to do a git diff between the (unmerged) git-annex branches
+of the git repo, and its remote. Parse that for lines that add a key to
+either, and queue transfers. That should work fairly efficiently when the
+remote is a git repository. Indeed, git-annex already does such a diff
+when it's doing a union merge of data into the git-annex branch. It
+might even be possible to have the union merge and scan use the same
+git diff data.
+
+But that approach has several problems:
+
+1. The list of keys it would generate wouldn't have associated git
+ filenames, so the UI couldn't show the user what files were being
+ transferred.
+2. Worse, without filenames, any later features to exclude
+ files/directories from being transferred wouldn't work.
+3. Looking at a git diff of the git-annex branches would find keys
+ that were added to either side while the two repos were disconnected.
+ But if the two repos' keys were not fully in sync before they
+ disconnected (which is quite possible; transfers could be incomplete),
+ the diff would not show those older out of sync keys.
+
+The remote could also be a special remote. In this case, I have to either
+traverse the git working copy, or perhaps traverse the whole git-annex
+branch (which would have the same problems with filesnames not being
+available).
+
+If a traversal is done, should check all remotes, not just
+one. Probably worth handling the case where a remote is connected
+while in the middle of such a scan, so part of the scan needs to be
+redone to check it.
+
## done
1. Can use `git annex sync`, which already handles bidirectional syncing.
diff --git a/doc/forum/Fixing_up_corrupt_annexes.mdwn b/doc/forum/Fixing_up_corrupt_annexes.mdwn
new file mode 100644
index 000000000..be6beeca8
--- /dev/null
+++ b/doc/forum/Fixing_up_corrupt_annexes.mdwn
@@ -0,0 +1,10 @@
+I was wondering how does one recover from...
+
+<pre>
+(Recording state in git...)
+error: invalid object 100644 8f154c946adc039af5240cc650a0a95c840e6fa6 for '041/5a4/SHA256-s6148--7ddcf853e4b16e77ab8c3c855c46867e6ed61c7089c334edf98bbdd3fb3a89ba.log'
+fatal: git-write-tree: error building trees
+git-annex: failed to read sha from git write-tree
+</pre>
+
+The above was caught when i ran a "git annex fsck --fast" to check stash of files"
diff --git a/doc/forum/Fixing_up_corrupt_annexes/comment_1_cea21f96bcfb56aaab7ea03c1c804d2d._comment b/doc/forum/Fixing_up_corrupt_annexes/comment_1_cea21f96bcfb56aaab7ea03c1c804d2d._comment
new file mode 100644
index 000000000..335cbb51d
--- /dev/null
+++ b/doc/forum/Fixing_up_corrupt_annexes/comment_1_cea21f96bcfb56aaab7ea03c1c804d2d._comment
@@ -0,0 +1,7 @@
+[[!comment format=mdwn
+ username="http://joeyh.name/"
+ subject="comment 1"
+ date="2012-07-24T22:00:35Z"
+ content="""
+This is a corrupt git repository. See [[tips/what_to_do_when_a_repository_is_corrupted]]
+"""]]
diff --git a/doc/tips/what_to_do_when_you_lose_a_repository.mdwn b/doc/tips/what_to_do_when_you_lose_a_repository.mdwn
index 3be13b8ab..363eeea4e 100644
--- a/doc/tips/what_to_do_when_you_lose_a_repository.mdwn
+++ b/doc/tips/what_to_do_when_you_lose_a_repository.mdwn
@@ -16,4 +16,4 @@ are present.
If you later found the drive, you could let git-annex know it's found
like so:
- git annex semitrusted usbdrive
+ git annex semitrust usbdrive