From bd2b388fd8c668ed6fd031d0ed8a7edf3c7b67ee Mon Sep 17 00:00:00 2001
From: Joey Hess <joey@kitenet.net>
Date: Wed, 25 Jul 2012 15:07:41 -0400
Subject: update

---
 doc/design/assistant/syncing.mdwn | 100 ++++++++++++++++++++------------------
 1 file changed, 54 insertions(+), 46 deletions(-)

(limited to 'doc/design')

diff --git a/doc/design/assistant/syncing.mdwn b/doc/design/assistant/syncing.mdwn
index f04f20218..3aeb76afc 100644
--- a/doc/design/assistant/syncing.mdwn
+++ b/doc/design/assistant/syncing.mdwn
@@ -5,53 +5,14 @@ all the other git clones, at both the git level and the key/value level.
 
 * At startup, and possibly periodically, or when the network connection
   changes, or some heuristic suggests that a remote was disconnected from
-  us for a while, queue remotes for processing by the TransferScanner,
-  to queue Transfers of files it or we're missing.
-* After git sync, identify content that we don't have that is now available
+  us for a while, queue remotes for processing by the TransferScanner.
+* Ensure that when a remote receives content, and updates its location log,
+  it syncs that update back out. Prerequisite for:
+* After git sync, identify new content that we don't have that is now available
   on remotes, and transfer. (Needed when we have a uni-directional connection
-  to a remote, so it won't be uploading content to us.) 
-  But first, need to ensure that when a remote
-  receives content, and updates its location log, it syncs that update
-  out.
-
-## TransferScanner
-
-The TransferScanner thread needs to find keys that need to be Uploaded
-to a remote, or Downloaded from it.
-
-How to find the keys to transfer? I'd like to avoid potentially
-expensive traversals of the whole git working copy if I can.
-
-One way would be to do a git diff between the (unmerged) git-annex branches
-of the git repo, and its remote. Parse that for lines that add a key to
-either, and queue transfers. That should work fairly efficiently when the
-remote is a git repository. Indeed, git-annex already does such a diff
-when it's doing a union merge of data into the git-annex branch. It
-might even be possible to have the union merge and scan use the same
-git diff data.
-
-But that approach has several problems:
-
-1. The list of keys it would generate wouldn't have associated git
-   filenames, so the UI couldn't show the user what files were being
-   transferred.
-2. Worse, without filenames, any later features to exclude
-   files/directories from being transferred wouldn't work.
-3. Looking at a git diff of the git-annex branches would find keys
-   that were added to either side while the two repos were disconnected.
-   But if the two repos' keys were not fully in sync before they
-   disconnected (which is quite possible; transfers could be incomplete),
-   the diff would not show those older out of sync keys.
-
-The remote could also be a special remote. In this case, I have to either
-traverse the git working copy, or perhaps traverse the whole git-annex
-branch (which would have the same problems with filesnames not being
-available).
-
-If a traversal is done, should check all remotes, not just
-one. Probably worth handling the case where a remote is connected
-while in the middle of such a scan, so part of the scan needs to be
-redone to check it.
+  to a remote, so it won't be uploading content to us.) Note: Does not
+  need to use the TransferScanner, if we get and check a list of the changed
+  files.
 
 ## longer-term TODO
 
@@ -75,6 +36,12 @@ redone to check it.
 * speed up git syncing by using the cached ssh connection for it too
   (will need to use `GIT_SSH`, which needs to point to a command to run,
   not a shell command line)
+* Map the network of git repos, and use that map to calculate
+  optimal transfers to keep the data in sync. Currently a naive flood fill
+  is done instead.
+* Find a more efficient way for the TransferScanner to find the transfers
+  that need to be done to sync with a remote. Currently it walks the git
+  working copy and checks each file.
 
 ## data syncing
 
@@ -99,6 +66,47 @@ reachable remote. This is worth doing first, since it's the simplest way to
 get the basic functionality of the assistant to work. And we'll need this
 anyway.
 
+## TransferScanner
+
+The TransferScanner thread needs to find keys that need to be Uploaded
+to a remote, or Downloaded from it.
+
+How to find the keys to transfer? I'd like to avoid potentially
+expensive traversals of the whole git working copy if I can.
+(Currently, the TransferScanner does do the naive and possibly expensive
+scan of the git working copy.)
+
+One way would be to do a git diff between the (unmerged) git-annex branches
+of the git repo, and its remote. Parse that for lines that add a key to
+either, and queue transfers. That should work fairly efficiently when the
+remote is a git repository. Indeed, git-annex already does such a diff
+when it's doing a union merge of data into the git-annex branch. It
+might even be possible to have the union merge and scan use the same
+git diff data.
+
+But that approach has several problems:
+
+1. The list of keys it would generate wouldn't have associated git
+   filenames, so the UI couldn't show the user what files were being
+   transferred.
+2. Worse, without filenames, any later features to exclude
+   files/directories from being transferred wouldn't work.
+3. Looking at a git diff of the git-annex branches would find keys
+   that were added to either side while the two repos were disconnected.
+   But if the two repos' keys were not fully in sync before they
+   disconnected (which is quite possible; transfers could be incomplete),
+   the diff would not show those older out of sync keys.
+
+The remote could also be a special remote. In this case, I have to either
+traverse the git working copy, or perhaps traverse the whole git-annex
+branch (which would have the same problems with filesnames not being
+available).
+
+If a traversal is done, should check all remotes, not just
+one. Probably worth handling the case where a remote is connected
+while in the middle of such a scan, so part of the scan needs to be
+redone to check it.
+
 ## done
 
 1. Can use `git annex sync`, which already handles bidirectional syncing.
-- 
cgit v1.2.3


From 2e085c6383f096a58d1e9b52ae457f9491850c7f Mon Sep 17 00:00:00 2001
From: Joey Hess <joey@kitenet.net>
Date: Wed, 25 Jul 2012 15:31:26 -0400
Subject: blog for the day

---
 .../assistant/blog/day_43__simple_scanner.mdwn     | 37 ++++++++++++++++++++++
 1 file changed, 37 insertions(+)
 create mode 100644 doc/design/assistant/blog/day_43__simple_scanner.mdwn

(limited to 'doc/design')

diff --git a/doc/design/assistant/blog/day_43__simple_scanner.mdwn b/doc/design/assistant/blog/day_43__simple_scanner.mdwn
new file mode 100644
index 000000000..11ee3cca4
--- /dev/null
+++ b/doc/design/assistant/blog/day_43__simple_scanner.mdwn
@@ -0,0 +1,37 @@
+Milestone: I can run `git annex assistant`, plug in a USB drive, and it
+automatically transfers files to get the USB drive and current repo back in
+sync.
+
+I decided to implement the naive scan, to find files needing to be
+transferred. So it walks through `git ls-files` and checks each file
+in turn. I've deferred less expensive, more sophisticated approaches to later.
+
+I did some work on the TransferQueue, which now keeps track of the length
+of the queue, and can block attempts to add Transfers to it if it gets too
+long. This was a nice use of STM, which let me implement that without using
+any locking.
+
+[[!format haskell """
+atomically $ do
+        sz <- readTVar (queuesize q)
+        if sz <= wantsz
+                then enqueue schedule q t (stubInfo f remote)
+                else retry -- blocks until queuesize changes
+"""]]
+
+Anyway, the point was that, as the scan finds Transfers to do,
+it doesn't build up a really long TransferQueue, but instead is blocked
+from running further until some of the files get transferred. The resulting
+interleaving of the scan thread with transfer threads means that transfers
+start fairly quickly upon a USB drive being plugged in, and kind of hides
+the innefficiencies of the scanner, which will most of the time be
+swamped out by the IO bound large data transfers.
+
+---
+
+At this point, the assistant should do a good job of keeping repositories
+in sync, as long as they're all interconnected, or on removable media
+like USB drives. There's lots more work to be done to handle use cases
+where repositories are not well-connected, but since the assistant's
+[[syncing]] now covers at least a couple of use cases, I'm ready to move
+on to the next phase. [[Webapp]], here we come!
-- 
cgit v1.2.3