From 892f1e6abefefee06dd3d2a3de8e9682f1848d88 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Sun, 22 Jul 2012 23:49:52 -0400 Subject: TransferScanner design thoughts --- doc/design/assistant/syncing.mdwn | 53 +++++++++++++++++++++++++++++++++------ 1 file changed, 46 insertions(+), 7 deletions(-) diff --git a/doc/design/assistant/syncing.mdwn b/doc/design/assistant/syncing.mdwn index c8fb9882a..a0e8d9d05 100644 --- a/doc/design/assistant/syncing.mdwn +++ b/doc/design/assistant/syncing.mdwn @@ -3,16 +3,55 @@ all the other git clones, at both the git level and the key/value level. ## immediate action items -* At startup, and possibly periodically, look for files we have that - location tracking indicates remotes do not, and enqueue Uploads for - them. Also, enqueue Downloads for any files we're missing. +* At startup, and possibly periodically, or when the network connection + changes, or some heuristic suggests that a remote was disconnected from + us for a while, queue remotes for processing by the TransferScanner, + to queue Transfers of files it or we're missing. * After git sync, identify content that we don't have that is now available - on remotes, and transfer. But first, need to ensure that when a remote + on remotes, and transfer. (Needed when we have a uni-directional connection + to a remote, so it won't be uploading content to us.) + But first, need to ensure that when a remote receives content, and updates its location log, it syncs that update out. -* When MountWatcher detects a newly mounted drive, rescan git remotes - in order to get ones on the drive, and do a git sync and file transfers - to sync any repositories on it. + +## TransferScanner + +The TransferScanner thread needs to find keys that need to be Uploaded +to a remote, or Downloaded from it. + +How to find the keys to transfer? I'd like to avoid potentially +expensive traversals of the whole git working copy if I can. + +One way would be to do a git diff between the (unmerged) git-annex branches +of the git repo, and its remote. Parse that for lines that add a key to +either, and queue transfers. That should work fairly efficiently when the +remote is a git repository. Indeed, git-annex already does such a diff +when it's doing a union merge of data into the git-annex branch. It +might even be possible to have the union merge and scan use the same +git diff data. + +But that approach has several problems: + +1. The list of keys it would generate wouldn't have associated git + filenames, so the UI couldn't show the user what files were being + transferred. +2. Worse, without filenames, any later features to exclude + files/directories from being transferred wouldn't work. +3. Looking at a git diff of the git-annex branches would find keys + that were added to either side while the two repos were disconnected. + But if the two repos' keys were not fully in sync before they + disconnected (which is quite possible; transfers could be incomplete), + the diff would not show those older out of sync keys. + +The remote could also be a special remote. In this case, I have to either +traverse the git working copy, or perhaps traverse the whole git-annex +branch (which would have the same problems with filesnames not being +available). + +If a traversal is done, should check all remotes, not just +one. Probably worth handling the case where a remote is connected +while in the middle of such a scan, so part of the scan needs to be +redone to check it. ## longer-term TODO -- cgit v1.2.3