From f27da7a1cc095dcaf9ce0cc2170fe98d3b050336 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 21 Jun 2012 20:02:00 -0400 Subject: blog for the day and design update --- .../blog/day_14__thinking_about_syncing.mdwn | 44 ++++++++++++++++++++++ doc/design/assistant/syncing.mdwn | 14 +++++-- 2 files changed, 55 insertions(+), 3 deletions(-) create mode 100644 doc/design/assistant/blog/day_14__thinking_about_syncing.mdwn (limited to 'doc') diff --git a/doc/design/assistant/blog/day_14__thinking_about_syncing.mdwn b/doc/design/assistant/blog/day_14__thinking_about_syncing.mdwn new file mode 100644 index 000000000..c4a700d13 --- /dev/null +++ b/doc/design/assistant/blog/day_14__thinking_about_syncing.mdwn @@ -0,0 +1,44 @@ +Pondering [[syncing]] today. I will be doing syncing of the git repository +first, and working on syncing of file data later. + +The former seems straightforward enough, since we just want to push all +changes to everywhere. Indeed, git-annex already has a [[sync]] command +that uses a smart technique to allow syncing between clones without a +central bare repository. (Props to Joachim Breitner for that.) + +But it's not all easy. Syncing should happen as fast as possible, so +changes show up without delay. Eventually it'll need to support syncing +between nodes that cannot directly contact one-another. Syncing needs to +deal with nodes coming and going; one example of that is a USB drive being +plugged in, which should immediatly be synced, but network can also come +and go, so it should periodically retry nodes it failed to sync with. To +start with, I'll be focusing on fast syncing between directly connected +nodes, but I have to keep this wider problem space in mind. + +One problem with `git annex sync` is that it has to be run in both clones +in order for changes to fully propigate. This is because git doesn't allow +pushing changes into a non-bare repository; so instead it drops off a new +branch in `.git/refs/remotes/$foo/synced/master`. Then when it's run locally +it merges that new branch into `master`. + +So, how to trigger a clone to run `git annex sync` when syncing to it? +Well, I just realized I have spent two weeks developing something that can +be repurposed to do that! [[Inotify]] can watch for changes to +`.git/refs/remotes`, and the instant a change is made, the local sync +process can be started. This avoids needing to make another ssh connection +to trigger the sync, so is faster and allows the data to be transferred +over another protocol than ssh, which may come in handy later. + +So, in summary, here's what will happen when a new file is created: + +1. inotify event causes the file to be added to the annex, and + immediately committed. +2. new branch is pushed to remotes (probably in parallel) +3. remotes notice new sync branch and merge it +4. (data sync, TBD later) +5. file is fully synced and available + +Steps 1, 2, and 3 should all be able to be accomplished in under a second. +The speed of `git push` making a ssh connection will be the main limit +to making it fast. (Perhaps I should also reuse git-annex's existing ssh +connection caching code?) diff --git a/doc/design/assistant/syncing.mdwn b/doc/design/assistant/syncing.mdwn index 0813b8b70..56c9692e3 100644 --- a/doc/design/assistant/syncing.mdwn +++ b/doc/design/assistant/syncing.mdwn @@ -3,13 +3,21 @@ all the other git clones, at both the git level and the key/value level. ## git syncing -1. At regular intervals, just run `git annex sync`, which already handles - bidirectional syncing. +1. Can use `git annex sync`, which already handles bidirectional syncing. + When a change is committed, launch the part of `git annex sync` that pushes + out changes. +1. Watch `.git/refs/remotes/` for changes (which would be pushed in from + another node via `git annex sync`), and run the part of `git annex sync` + that merges in received changes, and follow it by the part that pushes out + changes (sending them to any other remotes). + [The watching can be done with the existing inotify code! This avoids needing + any special mechanism to notify a remote that it's been synced to.] 2. Use a git merge driver that adds both conflicting files, so conflicts never break a sync. 3. Investigate the XMPP approach like dvcs-autosync does, or other ways of signaling a change out of band. -4. Add a hook, so when there's a change to sync, a program can be run. +4. Add a hook, so when there's a change to sync, a program can be run + and do its own signaling. ## data syncing -- cgit v1.2.3