diff options
author | Joey Hess <joey@kitenet.net> | 2012-07-25 23:18:39 -0400 |
---|---|---|
committer | Joey Hess <joey@kitenet.net> | 2012-07-25 23:18:39 -0400 |
commit | abe5a73d3f50edc679cd990c0e8e27c36b775d29 (patch) | |
tree | 84c14cf012e9dfb9f061428ccfd1f752c7e07c37 /doc/design/assistant/syncing.mdwn | |
parent | 1ffef3ad75e51b7f66c4ffdd0935a0495042e5ae (diff) | |
parent | 3a02c7b635fc1017c05874b8a6f54a91a587651d (diff) |
Merge branch 'master' into assistant
Diffstat (limited to 'doc/design/assistant/syncing.mdwn')
-rw-r--r-- | doc/design/assistant/syncing.mdwn | 100 |
1 files changed, 54 insertions, 46 deletions
diff --git a/doc/design/assistant/syncing.mdwn b/doc/design/assistant/syncing.mdwn index cc23f786f..4d7d70022 100644 --- a/doc/design/assistant/syncing.mdwn +++ b/doc/design/assistant/syncing.mdwn @@ -5,53 +5,14 @@ all the other git clones, at both the git level and the key/value level. * At startup, and possibly periodically, or when the network connection changes, or some heuristic suggests that a remote was disconnected from - us for a while, queue remotes for processing by the TransferScanner, - to queue Transfers of files it or we're missing. -* After git sync, identify content that we don't have that is now available + us for a while, queue remotes for processing by the TransferScanner. +* Ensure that when a remote receives content, and updates its location log, + it syncs that update back out. Prerequisite for: +* After git sync, identify new content that we don't have that is now available on remotes, and transfer. (Needed when we have a uni-directional connection - to a remote, so it won't be uploading content to us.) - But first, need to ensure that when a remote - receives content, and updates its location log, it syncs that update - out. - -## TransferScanner - -The TransferScanner thread needs to find keys that need to be Uploaded -to a remote, or Downloaded from it. - -How to find the keys to transfer? I'd like to avoid potentially -expensive traversals of the whole git working copy if I can. - -One way would be to do a git diff between the (unmerged) git-annex branches -of the git repo, and its remote. Parse that for lines that add a key to -either, and queue transfers. That should work fairly efficiently when the -remote is a git repository. Indeed, git-annex already does such a diff -when it's doing a union merge of data into the git-annex branch. It -might even be possible to have the union merge and scan use the same -git diff data. - -But that approach has several problems: - -1. The list of keys it would generate wouldn't have associated git - filenames, so the UI couldn't show the user what files were being - transferred. -2. Worse, without filenames, any later features to exclude - files/directories from being transferred wouldn't work. -3. Looking at a git diff of the git-annex branches would find keys - that were added to either side while the two repos were disconnected. - But if the two repos' keys were not fully in sync before they - disconnected (which is quite possible; transfers could be incomplete), - the diff would not show those older out of sync keys. - -The remote could also be a special remote. In this case, I have to either -traverse the git working copy, or perhaps traverse the whole git-annex -branch (which would have the same problems with filesnames not being -available). - -If a traversal is done, should check all remotes, not just -one. Probably worth handling the case where a remote is connected -while in the middle of such a scan, so part of the scan needs to be -redone to check it. + to a remote, so it won't be uploading content to us.) Note: Does not + need to use the TransferScanner, if we get and check a list of the changed + files. ## longer-term TODO @@ -75,6 +36,12 @@ redone to check it. * speed up git syncing by using the cached ssh connection for it too (will need to use `GIT_SSH`, which needs to point to a command to run, not a shell command line) +* Map the network of git repos, and use that map to calculate + optimal transfers to keep the data in sync. Currently a naive flood fill + is done instead. +* Find a more efficient way for the TransferScanner to find the transfers + that need to be done to sync with a remote. Currently it walks the git + working copy and checks each file. ## misc todo @@ -105,6 +72,47 @@ reachable remote. This is worth doing first, since it's the simplest way to get the basic functionality of the assistant to work. And we'll need this anyway. +## TransferScanner + +The TransferScanner thread needs to find keys that need to be Uploaded +to a remote, or Downloaded from it. + +How to find the keys to transfer? I'd like to avoid potentially +expensive traversals of the whole git working copy if I can. +(Currently, the TransferScanner does do the naive and possibly expensive +scan of the git working copy.) + +One way would be to do a git diff between the (unmerged) git-annex branches +of the git repo, and its remote. Parse that for lines that add a key to +either, and queue transfers. That should work fairly efficiently when the +remote is a git repository. Indeed, git-annex already does such a diff +when it's doing a union merge of data into the git-annex branch. It +might even be possible to have the union merge and scan use the same +git diff data. + +But that approach has several problems: + +1. The list of keys it would generate wouldn't have associated git + filenames, so the UI couldn't show the user what files were being + transferred. +2. Worse, without filenames, any later features to exclude + files/directories from being transferred wouldn't work. +3. Looking at a git diff of the git-annex branches would find keys + that were added to either side while the two repos were disconnected. + But if the two repos' keys were not fully in sync before they + disconnected (which is quite possible; transfers could be incomplete), + the diff would not show those older out of sync keys. + +The remote could also be a special remote. In this case, I have to either +traverse the git working copy, or perhaps traverse the whole git-annex +branch (which would have the same problems with filesnames not being +available). + +If a traversal is done, should check all remotes, not just +one. Probably worth handling the case where a remote is connected +while in the middle of such a scan, so part of the scan needs to be +redone to check it. + ## done 1. Can use `git annex sync`, which already handles bidirectional syncing. |