doc/design/assistant/syncing.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157

Once files are added (or removed or moved), need to send those changes to
all the other git clones, at both the git level and the key/value level.

## immediate action items

* At startup, and possibly periodically, or when the network connection
  changes, or some heuristic suggests that a remote was disconnected from
  us for a while, queue remotes for processing by the TransferScanner,
  to queue Transfers of files it or we're missing.
* After git sync, identify content that we don't have that is now available
  on remotes, and transfer. (Needed when we have a uni-directional connection
  to a remote, so it won't be uploading content to us.) 
  But first, need to ensure that when a remote
  receives content, and updates its location log, it syncs that update
  out.

## TransferScanner

The TransferScanner thread needs to find keys that need to be Uploaded
to a remote, or Downloaded from it.

How to find the keys to transfer? I'd like to avoid potentially
expensive traversals of the whole git working copy if I can.

One way would be to do a git diff between the (unmerged) git-annex branches
of the git repo, and its remote. Parse that for lines that add a key to
either, and queue transfers. That should work fairly efficiently when the
remote is a git repository. Indeed, git-annex already does such a diff
when it's doing a union merge of data into the git-annex branch. It
might even be possible to have the union merge and scan use the same
git diff data.

But that approach has several problems:

1. The list of keys it would generate wouldn't have associated git
   filenames, so the UI couldn't show the user what files were being
   transferred.
2. Worse, without filenames, any later features to exclude
   files/directories from being transferred wouldn't work.
3. Looking at a git diff of the git-annex branches would find keys
   that were added to either side while the two repos were disconnected.
   But if the two repos' keys were not fully in sync before they
   disconnected (which is quite possible; transfers could be incomplete),
   the diff would not show those older out of sync keys.

The remote could also be a special remote. In this case, I have to either
traverse the git working copy, or perhaps traverse the whole git-annex
branch (which would have the same problems with filesnames not being
available).

If a traversal is done, should check all remotes, not just
one. Probably worth handling the case where a remote is connected
while in the middle of such a scan, so part of the scan needs to be
redone to check it.

## longer-term TODO

* Test MountWatcher on LXDE.
* git-annex needs a simple speed control knob, which can be plumbed
  through to, at least, rsync. A good job for an hour in an
  airport somewhere.
* Find a way to probe available outgoing bandwidth, to throttle so
  we don't bufferbloat the network to death.
* Investigate the XMPP approach like dvcs-autosync does, or other ways of
   signaling a change out of band.
* Add a hook, so when there's a change to sync, a program can be run
   and do its own signaling.
* --debug will show often unnecessary work being done. Optimise.
* This assumes the network is connected. It's often not, so the
  [[cloud]] needs to be used to bridge between LANs.
* Configurablity, including only enabling git syncing but not data transfer;
  only uploading new files but not downloading, and only downloading
  files in some directories and not others. See for use cases:
  [[forum/Wishlist:_options_for_syncing_meta-data_and_data]]
* speed up git syncing by using the cached ssh connection for it too
  (will need to use `GIT_SSH`, which needs to point to a command to run,
  not a shell command line)

## misc todo

* --debug will show often unnecessary work being done. Optimise.
* It would be nice if, when a USB drive is connected, 
  syncing starts automatically. Use dbus on Linux?

## data syncing

There are two parts to data syncing. First, map the network and second,
decide what to sync when.

Mapping the network can reuse code in `git annex map`. Once the map is
built, we want to find paths through the network that reach all nodes
eventually, with the least cost. This is a minimum spanning tree problem,
except with a directed graph, so really a Arborescence problem.

With the map, we can determine which nodes to push new content to. Then we
need to control those data transfers, sending to the cheapest nodes first,
and with appropriate rate limiting and control facilities.

This probably will need lots of refinements to get working well.

### first pass: flood syncing

Before mapping the network, the best we can do is flood all files out to every
reachable remote. This is worth doing first, since it's the simplest way to
get the basic functionality of the assistant to work. And we'll need this
anyway.

## done

1. Can use `git annex sync`, which already handles bidirectional syncing.
   When a change is committed, launch the part of `git annex sync` that pushes
   out changes. **done**; changes are pushed out to all remotes in parallel
1. Watch `.git/refs/remotes/` for changes (which would be pushed in from
   another node via `git annex sync`), and run the part of `git annex sync`
   that merges in received changes, and follow it by the part that pushes out
   changes (sending them to any other remotes).
   [The watching can be done with the existing inotify code! This avoids needing
   any special mechanism to notify a remote that it's been synced to.]  
   **done**
1. Periodically retry pushes that failed.  **done** (every half an hour)
1. Also, detect if a push failed due to not being up-to-date, pull,
   and repush. **done**
2. Use a git merge driver that adds both conflicting files,
   so conflicts never break a sync. **done**

* on-disk transfers in progress information files (read/write/enumerate)
  **done**
* locking for the files, so redundant transfer races can be detected,
  and failed transfers noticed **done**
* transfer info for git-annex-shell **done**
* update files as transfers proceed. See [[progressbars]]
  (updating for downloads is easy; for uploads is hard)
* add Transfer queue TChan **done**
* add TransferInfo Map to DaemonStatus for tracking transfers in progress.
  **done**
* Poll transfer in progress info files for changes (use inotify again!
  wow! hammer, meet nail..), and update the TransferInfo Map **done**
* enqueue Transfers (Uploads) as new files are added to the annex by
  Watcher. **done**
* enqueue Tranferrs (Downloads) as new dangling symlinks are noticed by
  Watcher. **done**
  (Note: Needs git-annex branch to be merged before the tree is merged,
  so it knows where to download from. Checked and this is the case.)
* Write basic Transfer handling thread. Multiple such threads need to be
  able to be run at once. Each will need its own independant copy of the 
  Annex state monad. **done**
* Write transfer control thread, which decides when to launch transfers.
  **done**
* Transfer watching has a race on kqueue systems, which makes finished
  fast transfers not be noticed by the TransferWatcher. Which in turn
  prevents the transfer slot being freed and any further transfers
  from happening. So, this approach is too fragile to rely on for
  maintaining the TransferSlots. Instead, need [[todo/assistant_threaded_runtime]],
  which would allow running something for sure when a transfer thread
  finishes. **done**
* Test MountWatcher on KDE, and add whatever dbus events KDE emits when
  drives are mounted. **done**