summaryrefslogtreecommitdiff
path: root/doc/design/assistant/blog/day_14__thinking_about_syncing.mdwn
blob: 4173fbf777b9ee8ac5b58a17da9e670a808340b1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
Pondering [[syncing]] today. I will be doing syncing of the git repository
first, and working on syncing of file data later.

The former seems straightforward enough, since we just want to push all
changes to everywhere. Indeed, git-annex already has a [[sync]] command
that uses a smart technique to allow syncing between clones without a
central bare repository. (Props to Joachim Breitner for that.)

But it's not all easy. Syncing should happen as fast as possible, so
changes show up without delay. Eventually it'll need to support syncing
between nodes that cannot directly contact one-another. Syncing needs to
deal with nodes coming and going; one example of that is a USB drive being
plugged in, which should immediately be synced, but network can also come
and go, so it should periodically retry nodes it failed to sync with. To
start with, I'll be focusing on fast syncing between directly connected
nodes, but I have to keep this wider problem space in mind.

One problem with `git annex sync` is that it has to be run in both clones
in order for changes to fully propagate. This is because git doesn't allow
pushing changes into a non-bare repository; so instead it drops off a new
branch in `.git/refs/remotes/$foo/synced/master`. Then when it's run locally
it merges that new branch into `master`. 

So, how to trigger a clone to run `git annex sync` when syncing to it?
Well, I just realized I have spent two weeks developing something that can
be repurposed to do that! [[Inotify]] can watch for changes to
`.git/refs/remotes`, and the instant a change is made, the local sync
process can be started. This avoids needing to make another ssh connection
to trigger the sync, so is faster and allows the data to be transferred
over another protocol than ssh, which may come in handy later.

So, in summary, here's what will happen when a new file is created:

1. inotify event causes the file to be added to the annex, and
   immediately committed.
2. new branch is pushed to remotes (probably in parallel)
3. remotes notice new sync branch and merge it
4. (data sync, TBD later)
5. file is fully synced and available

Steps 1, 2, and 3 should all be able to be accomplished in under a second.
The speed of `git push` making a ssh connection will be the main limit
to making it fast. (Perhaps I should also reuse git-annex's existing ssh
connection caching code?)