summaryrefslogtreecommitdiff
path: root/doc/design/assistant/cloud.mdwn
blob: 264011de4527f54f6ec38a3a03d530983c9a344a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
The [[syncing]] design assumes the network is connected. But it's often
not in these pre-IPV6 days, so the cloud needs to be used to bridge between
LANS.

## more cloud providers

Git-annex already supports storing large files in 
several cloud providers via [[special_remotes]].
More should be added, such as:

* Google drive (attractive because it's free, only 5 gb tho)
* OpenStack Swift (teh future)
* Box.com (it's free, and current method is hard to set up and a sorta
  shakey; a better method would be to use its API)
* Dropbox? That would be ironic.. Via its API, presumably.
* [[Amazon Glacier|todo/special_remote_for_amazon_glacier]]
* [nimbus.io](https://nimbus.io/) Fairly low prices ($0.06/GB);
  REST API; free software

See poll at [[polls/prioritizing_special_remotes]].

## The cloud notification problem

Alice and Bob have repos, and there is a cloud remote they both share.
Alice adds a file; the assistant transfers it to the cloud remote.
How does Bob find out about it?

There are two parts to this problem. Bob needs to find out that there's
been a change to Alice's git repo. Then he needs to pull from Alice's git repo,
or some other repo in the cloud she pushed to. Once both steps are done,
the assistant will transfer the file from the cloud to Bob.

* dvcs-autosync uses jabber; all repos need to have the same jabber account
  configured, and send self-messages. An alternative would be to have
  different accounts that join a channel or message each other. Still needs
  account configuration.
* irc could be used. With a default irc network, and an agreed-upon channel,
  no configuration should be needed. IRC might be harder to get through
  some firewalls, and is prone to netsplits, etc. IRC networks have reasons
  to be wary of bots using them. Only basic notifications could be done over
  irc, as it has little security.
* When there's a ssh server involved, code could be run on it to notify
  logged-in clients. But this is not a general solution to this problem.
* pubsubhubbub does not seem like an option; its hubs want to pull down
  a feed over http.

### jabber TODO

* test with big servers, eg google chat
* Prevent idle disconnection. Probably means sending or receiving pings,
  but would prefer to avoid eg pinging every 60 seconds as some clients do.
* Make the git-annex clients invisible, so a user can use their regular
  account without always seeming to be present when git-annex is logged in.
  See <http://xmpp.org/extensions/xep-0126.html>
* webapp configuration
* After pulling from a remote, may need to scan for transfers, which
  could involve other remotes (ie, S3). Since the remote client is not able to
  talk to us directly, it won't be able to upload any new files to us.
  Need a fast way to find new files, and get them transferring. The expensive
  transfer scan may be needed to get fully in sync, but is too expensive to
  run every time this happens.

### jabber security

Any data git-annex sends over this XMPP will be visible to the XMPP
account's buddies, to the XMPP server, and quite likely to other interested
parties. So it's important to consider the security exposure of using it.

If git-annex sends only a single bit notification, this lets attackers know
when the user is active and changing files. Although the assistant's other
syncing activities can somewhat mask this.

As soon as git-annex does anything unlike any other client, an attacker can
see how many clients are connected for a user, and fingerprint the ones
running git-annex, and determine how many clients are running git-annex.

If git-annex sent the UUID of the remote it pushed to, this would let
attackers determine how many different remotes are being used,
and map some of the connections between clients and remotes.

## storing git repos in the cloud

Of course, one option is to just use github etc to store the git repo.

Two things can store git repos in Amazon S3:
* <http://gabrito.com/post/storing-git-repositories-in-amazon-s3-for-high-availability>
* <http://wiki.cs.pdx.edu/oss2009/index/projects/gits3.html>

Another option is to not store the git repo in the cloud, but push/pull
peer-to-peer. When peers cannot directly talk to one-another, this could be
bounced through something like XMPP.