doc/design/assistant/syncing.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118

Once files are added (or removed or moved), need to send those changes to
all the other git clones, at both the git level and the key/value level.

## action items

* on-disk transfers in progress information files (read/write/enumerate)
  **done**
* locking for the files, so redundant transfer races can be detected,
  and failed transfers noticed **done**
* transfer info for git-annex-shell **done**
* update files as transfers proceed. See [[progressbars]]
  (updating for downloads is easy; for uploads is hard)
* add Transfer queue TChan
* enqueue Transfers (Uploads) as new files are added to the annex by
  Watcher.
* enqueue Tranferrs (Downloads) as new dangling symlinks are noticed by
  Watcher.
* add TransferInfo Map to DaemonStatus for tracking transfers in progress.
* Poll transfer in progress info files for changes (use inotify again!
  wow! hammer, meet nail..), and update the TransferInfo Map
* Write basic Transfer handling thread. Multiple such threads need to be
  able to be run at once. Each will need its own independant copy of the 
  Annex state monad.
* Write transfer control thread, which decides when to launch transfers.
* At startup, and possibly periodically, look for files we have that
  location tracking indicates remotes do not, and enqueue Uploads for
  them. Also, enqueue Downloads for any files we're missing.
* Find a way to probe available outgoing bandwidth, to throttle so
  we don't bufferbloat the network to death.
* git-annex needs a simple speed control knob, which can be plumbed
  through to, at least, rsync. A good job for an hour in an
  airport somewhere.

## git syncing

1. Can use `git annex sync`, which already handles bidirectional syncing.
   When a change is committed, launch the part of `git annex sync` that pushes
   out changes. **done**; changes are pushed out to all remotes in parallel
1. Watch `.git/refs/remotes/` for changes (which would be pushed in from
   another node via `git annex sync`), and run the part of `git annex sync`
   that merges in received changes, and follow it by the part that pushes out
   changes (sending them to any other remotes).
   [The watching can be done with the existing inotify code! This avoids needing
   any special mechanism to notify a remote that it's been synced to.]  
   **done**
1. Periodically retry pushes that failed.  **done** (every half an hour)
1. Also, detect if a push failed due to not being up-to-date, pull,
   and repush. **done**
2. Use a git merge driver that adds both conflicting files,
   so conflicts never break a sync. **done**
3. Investigate the XMPP approach like dvcs-autosync does, or other ways of
   signaling a change out of band.
4. Add a hook, so when there's a change to sync, a program can be run
   and do its own signaling.

## data syncing

There are two parts to data syncing. First, map the network and second,
decide what to sync when.

Mapping the network can reuse code in `git annex map`. Once the map is
built, we want to find paths through the network that reach all nodes
eventually, with the least cost. This is a minimum spanning tree problem,
except with a directed graph, so really a Arborescence problem.

With the map, we can determine which nodes to push new content to. Then we
need to control those data transfers, sending to the cheapest nodes first,
and with appropriate rate limiting and control facilities.

This probably will need lots of refinements to get working well.

### first pass: flood syncing

Before mapping the network, the best we can do is flood all files out to every
reachable remote. This is worth doing first, since it's the simplest way to
get the basic functionality of the assistant to work. And we'll need this
anyway.

### transfer tracking

* Upload added to queue by the watcher thread when it adds content.
* Download added to queue by the watcher thread when it seens new symlinks
  that lack content.
* Transfer threads started/stopped as necessary to move data.
  (May sometimes want multiple threads downloading, or uploading, or even both.)
	
	type TransferQueue = TChan [Transfer]
	-- add (M.Map Transfer TransferInfo) to DaemonStatus

	startTransfer :: Transfer -> Annex TransferID

	stopTransfer :: TransferID -> IO ()

The assistant needs to find out when `git-annex-shell` is receiving or
sending (triggered by another remote), so it can add data for those too.
This is important to avoid uploading content to a remote that is already
downloading it from us, or vice versa, as well as to in future let the web
app manage transfers as user desires. 

For files being received, it can see the temp file, but other than lsof
there's no good way to find the pid (and I'd rather not kill blindly).

For files being sent, there's no filesystem indication. So git-annex-shell
(and other git-annex transfer processes) should write a status file to disk.

Can use file locking on these status files to claim upload/download rights,
which will avoid races.

This status file can also be updated periodically to show amount of transfer
complete (necessary for tracking uploads).

## other considerations

It would be nice if, when a USB drive is connected,
syncing starts automatically. Use dbus on Linux?

This assumes the network is connected. It's often not, so the
[[cloud]] needs to be used to bridge between LANs.