From bd2b388fd8c668ed6fd031d0ed8a7edf3c7b67ee Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 25 Jul 2012 15:07:41 -0400 Subject: update --- doc/design/assistant/syncing.mdwn | 100 ++++++++++++++++++++------------------ 1 file changed, 54 insertions(+), 46 deletions(-) (limited to 'doc/design') diff --git a/doc/design/assistant/syncing.mdwn b/doc/design/assistant/syncing.mdwn index f04f20218..3aeb76afc 100644 --- a/doc/design/assistant/syncing.mdwn +++ b/doc/design/assistant/syncing.mdwn @@ -5,53 +5,14 @@ all the other git clones, at both the git level and the key/value level. * At startup, and possibly periodically, or when the network connection changes, or some heuristic suggests that a remote was disconnected from - us for a while, queue remotes for processing by the TransferScanner, - to queue Transfers of files it or we're missing. -* After git sync, identify content that we don't have that is now available + us for a while, queue remotes for processing by the TransferScanner. +* Ensure that when a remote receives content, and updates its location log, + it syncs that update back out. Prerequisite for: +* After git sync, identify new content that we don't have that is now available on remotes, and transfer. (Needed when we have a uni-directional connection - to a remote, so it won't be uploading content to us.) - But first, need to ensure that when a remote - receives content, and updates its location log, it syncs that update - out. - -## TransferScanner - -The TransferScanner thread needs to find keys that need to be Uploaded -to a remote, or Downloaded from it. - -How to find the keys to transfer? I'd like to avoid potentially -expensive traversals of the whole git working copy if I can. - -One way would be to do a git diff between the (unmerged) git-annex branches -of the git repo, and its remote. Parse that for lines that add a key to -either, and queue transfers. That should work fairly efficiently when the -remote is a git repository. Indeed, git-annex already does such a diff -when it's doing a union merge of data into the git-annex branch. It -might even be possible to have the union merge and scan use the same -git diff data. - -But that approach has several problems: - -1. The list of keys it would generate wouldn't have associated git - filenames, so the UI couldn't show the user what files were being - transferred. -2. Worse, without filenames, any later features to exclude - files/directories from being transferred wouldn't work. -3. Looking at a git diff of the git-annex branches would find keys - that were added to either side while the two repos were disconnected. - But if the two repos' keys were not fully in sync before they - disconnected (which is quite possible; transfers could be incomplete), - the diff would not show those older out of sync keys. - -The remote could also be a special remote. In this case, I have to either -traverse the git working copy, or perhaps traverse the whole git-annex -branch (which would have the same problems with filesnames not being -available). - -If a traversal is done, should check all remotes, not just -one. Probably worth handling the case where a remote is connected -while in the middle of such a scan, so part of the scan needs to be -redone to check it. + to a remote, so it won't be uploading content to us.) Note: Does not + need to use the TransferScanner, if we get and check a list of the changed + files. ## longer-term TODO @@ -75,6 +36,12 @@ redone to check it. * speed up git syncing by using the cached ssh connection for it too (will need to use `GIT_SSH`, which needs to point to a command to run, not a shell command line) +* Map the network of git repos, and use that map to calculate + optimal transfers to keep the data in sync. Currently a naive flood fill + is done instead. +* Find a more efficient way for the TransferScanner to find the transfers + that need to be done to sync with a remote. Currently it walks the git + working copy and checks each file. ## data syncing @@ -99,6 +66,47 @@ reachable remote. This is worth doing first, since it's the simplest way to get the basic functionality of the assistant to work. And we'll need this anyway. +## TransferScanner + +The TransferScanner thread needs to find keys that need to be Uploaded +to a remote, or Downloaded from it. + +How to find the keys to transfer? I'd like to avoid potentially +expensive traversals of the whole git working copy if I can. +(Currently, the TransferScanner does do the naive and possibly expensive +scan of the git working copy.) + +One way would be to do a git diff between the (unmerged) git-annex branches +of the git repo, and its remote. Parse that for lines that add a key to +either, and queue transfers. That should work fairly efficiently when the +remote is a git repository. Indeed, git-annex already does such a diff +when it's doing a union merge of data into the git-annex branch. It +might even be possible to have the union merge and scan use the same +git diff data. + +But that approach has several problems: + +1. The list of keys it would generate wouldn't have associated git + filenames, so the UI couldn't show the user what files were being + transferred. +2. Worse, without filenames, any later features to exclude + files/directories from being transferred wouldn't work. +3. Looking at a git diff of the git-annex branches would find keys + that were added to either side while the two repos were disconnected. + But if the two repos' keys were not fully in sync before they + disconnected (which is quite possible; transfers could be incomplete), + the diff would not show those older out of sync keys. + +The remote could also be a special remote. In this case, I have to either +traverse the git working copy, or perhaps traverse the whole git-annex +branch (which would have the same problems with filesnames not being +available). + +If a traversal is done, should check all remotes, not just +one. Probably worth handling the case where a remote is connected +while in the middle of such a scan, so part of the scan needs to be +redone to check it. + ## done 1. Can use `git annex sync`, which already handles bidirectional syncing. -- cgit v1.2.3 From 2e085c6383f096a58d1e9b52ae457f9491850c7f Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 25 Jul 2012 15:31:26 -0400 Subject: blog for the day --- .../assistant/blog/day_43__simple_scanner.mdwn | 37 ++++++++++++++++++++++ 1 file changed, 37 insertions(+) create mode 100644 doc/design/assistant/blog/day_43__simple_scanner.mdwn (limited to 'doc/design') diff --git a/doc/design/assistant/blog/day_43__simple_scanner.mdwn b/doc/design/assistant/blog/day_43__simple_scanner.mdwn new file mode 100644 index 000000000..11ee3cca4 --- /dev/null +++ b/doc/design/assistant/blog/day_43__simple_scanner.mdwn @@ -0,0 +1,37 @@ +Milestone: I can run `git annex assistant`, plug in a USB drive, and it +automatically transfers files to get the USB drive and current repo back in +sync. + +I decided to implement the naive scan, to find files needing to be +transferred. So it walks through `git ls-files` and checks each file +in turn. I've deferred less expensive, more sophisticated approaches to later. + +I did some work on the TransferQueue, which now keeps track of the length +of the queue, and can block attempts to add Transfers to it if it gets too +long. This was a nice use of STM, which let me implement that without using +any locking. + +[[!format haskell """ +atomically $ do + sz <- readTVar (queuesize q) + if sz <= wantsz + then enqueue schedule q t (stubInfo f remote) + else retry -- blocks until queuesize changes +"""]] + +Anyway, the point was that, as the scan finds Transfers to do, +it doesn't build up a really long TransferQueue, but instead is blocked +from running further until some of the files get transferred. The resulting +interleaving of the scan thread with transfer threads means that transfers +start fairly quickly upon a USB drive being plugged in, and kind of hides +the innefficiencies of the scanner, which will most of the time be +swamped out by the IO bound large data transfers. + +--- + +At this point, the assistant should do a good job of keeping repositories +in sync, as long as they're all interconnected, or on removable media +like USB drives. There's lots more work to be done to handle use cases +where repositories are not well-connected, but since the assistant's +[[syncing]] now covers at least a couple of use cases, I'm ready to move +on to the next phase. [[Webapp]], here we come! -- cgit v1.2.3