From bd2b388fd8c668ed6fd031d0ed8a7edf3c7b67ee Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 25 Jul 2012 15:07:41 -0400 Subject: update --- doc/design/assistant/syncing.mdwn | 100 ++++++++++++++++++++------------------ 1 file changed, 54 insertions(+), 46 deletions(-) (limited to 'doc/design/assistant/syncing.mdwn') diff --git a/doc/design/assistant/syncing.mdwn b/doc/design/assistant/syncing.mdwn index f04f20218..3aeb76afc 100644 --- a/doc/design/assistant/syncing.mdwn +++ b/doc/design/assistant/syncing.mdwn @@ -5,53 +5,14 @@ all the other git clones, at both the git level and the key/value level. * At startup, and possibly periodically, or when the network connection changes, or some heuristic suggests that a remote was disconnected from - us for a while, queue remotes for processing by the TransferScanner, - to queue Transfers of files it or we're missing. -* After git sync, identify content that we don't have that is now available + us for a while, queue remotes for processing by the TransferScanner. +* Ensure that when a remote receives content, and updates its location log, + it syncs that update back out. Prerequisite for: +* After git sync, identify new content that we don't have that is now available on remotes, and transfer. (Needed when we have a uni-directional connection - to a remote, so it won't be uploading content to us.) - But first, need to ensure that when a remote - receives content, and updates its location log, it syncs that update - out. - -## TransferScanner - -The TransferScanner thread needs to find keys that need to be Uploaded -to a remote, or Downloaded from it. - -How to find the keys to transfer? I'd like to avoid potentially -expensive traversals of the whole git working copy if I can. - -One way would be to do a git diff between the (unmerged) git-annex branches -of the git repo, and its remote. Parse that for lines that add a key to -either, and queue transfers. That should work fairly efficiently when the -remote is a git repository. Indeed, git-annex already does such a diff -when it's doing a union merge of data into the git-annex branch. It -might even be possible to have the union merge and scan use the same -git diff data. - -But that approach has several problems: - -1. The list of keys it would generate wouldn't have associated git - filenames, so the UI couldn't show the user what files were being - transferred. -2. Worse, without filenames, any later features to exclude - files/directories from being transferred wouldn't work. -3. Looking at a git diff of the git-annex branches would find keys - that were added to either side while the two repos were disconnected. - But if the two repos' keys were not fully in sync before they - disconnected (which is quite possible; transfers could be incomplete), - the diff would not show those older out of sync keys. - -The remote could also be a special remote. In this case, I have to either -traverse the git working copy, or perhaps traverse the whole git-annex -branch (which would have the same problems with filesnames not being -available). - -If a traversal is done, should check all remotes, not just -one. Probably worth handling the case where a remote is connected -while in the middle of such a scan, so part of the scan needs to be -redone to check it. + to a remote, so it won't be uploading content to us.) Note: Does not + need to use the TransferScanner, if we get and check a list of the changed + files. ## longer-term TODO @@ -75,6 +36,12 @@ redone to check it. * speed up git syncing by using the cached ssh connection for it too (will need to use `GIT_SSH`, which needs to point to a command to run, not a shell command line) +* Map the network of git repos, and use that map to calculate + optimal transfers to keep the data in sync. Currently a naive flood fill + is done instead. +* Find a more efficient way for the TransferScanner to find the transfers + that need to be done to sync with a remote. Currently it walks the git + working copy and checks each file. ## data syncing @@ -99,6 +66,47 @@ reachable remote. This is worth doing first, since it's the simplest way to get the basic functionality of the assistant to work. And we'll need this anyway. +## TransferScanner + +The TransferScanner thread needs to find keys that need to be Uploaded +to a remote, or Downloaded from it. + +How to find the keys to transfer? I'd like to avoid potentially +expensive traversals of the whole git working copy if I can. +(Currently, the TransferScanner does do the naive and possibly expensive +scan of the git working copy.) + +One way would be to do a git diff between the (unmerged) git-annex branches +of the git repo, and its remote. Parse that for lines that add a key to +either, and queue transfers. That should work fairly efficiently when the +remote is a git repository. Indeed, git-annex already does such a diff +when it's doing a union merge of data into the git-annex branch. It +might even be possible to have the union merge and scan use the same +git diff data. + +But that approach has several problems: + +1. The list of keys it would generate wouldn't have associated git + filenames, so the UI couldn't show the user what files were being + transferred. +2. Worse, without filenames, any later features to exclude + files/directories from being transferred wouldn't work. +3. Looking at a git diff of the git-annex branches would find keys + that were added to either side while the two repos were disconnected. + But if the two repos' keys were not fully in sync before they + disconnected (which is quite possible; transfers could be incomplete), + the diff would not show those older out of sync keys. + +The remote could also be a special remote. In this case, I have to either +traverse the git working copy, or perhaps traverse the whole git-annex +branch (which would have the same problems with filesnames not being +available). + +If a traversal is done, should check all remotes, not just +one. Probably worth handling the case where a remote is connected +while in the middle of such a scan, so part of the scan needs to be +redone to check it. + ## done 1. Can use `git annex sync`, which already handles bidirectional syncing. -- cgit v1.2.3