summaryrefslogtreecommitdiff
path: root/DESIGN.md
blob: e4e1101be80090c60de98f2fe4cb203c4433bfca (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
laptopfs
========

Benjamin Barenblat, <bbarenblat@gmail.com>  
October 16, 2021


Objective
---------

Build a highly available network file system.


Requirements
------------

laptopfs is a network file system designed for use on laptop home directories.
This means it is

  - a file system, allowing the users to manipulate files and directories using
    established working patterns.

  - highly available and partition-resistant, supporting offline operation just
    as well as it supports online operation.

  - secure by default, ensuring confidentiality and integrity whenever data
    leave the laptop, including when they are stored on file servers.

  - designed for multiple concurrent active sessions.

  - disk- and bandwidth-efficient, transferring and storing only the data users
    actually need.

laptopfs is not

  - designed for collaboration. If two clients modify the file system at the
    same time, the server will reject one client’s write; while the unfortunate
    client will likely succeed on a second attempt, this does not scale well and
    should generally be avoided.

  - a completely POSIX-compatible file system. In particular, laptopfs does not
    support hard links, device nodes, or IPC mechanisms (named pipes, sockets,
    etc.).


Out of Scope
------------

Though laptopfs encrypts data whenever they leave the laptop and ensures the
user’s data are encrypted remotely, it does not encrypt data while they are
stored on the local disk. If a user requires this level of security, they should
leverage the functionality built into their operating system.

laptopfs clients use file system modification times to resolve certain types of
writer–writer races. To ensure that races are resolved in an unsurprising way,
clients should be well-synchronized to a coherent clock source. NTP (or even
SNTP) should be sufficient for this.


Background
----------

Distributed computing has captured both public imagination and corporate
attention for over half a century. The stories of Arthur C. Clarke envision a
distant future where mainframes are shared between entire neighborhoods. Later,
Sun Microsystems promised a distributed future, proclaiming that “The network is
the computer”. So-called distributed operating systems like Plan 9 from Bell
Labs continue to draw attention decades after their development mostly stopped.

However, the promised land of distributed computing has mostly failed to
materialize. Comparing visions against reality, it’s easy to see why: Today’s
world consists not of a vast network of computers communicating at the speed of
light, but rather a vast archipelago of high-speed, high-reliability network
fabrics joined by comparatively slow, sketchy, and expensive links. It is
telling that even security-conscious corporations have had to make firewall
exceptions for Mosh, a remote terminal protocol that supports low bandwidth and
high latency, and that the world’s most popular “distributed” version control
system, Git, is the one that allows users to work productively with no network
connection at all.

While researchers and developers have made incredible progress building fast and
secure file systems that operate on individual machines and on high-speed
network fabrics, there currently exists no file system that truly embraces the
archipelago model. NFS and AFS can be configured to support low bandwidth and
high latency, but neither allows disconnected operation. Past efforts to add
support for true disconnected operation to existing network file systems have
stalled. Most users thus access their remote files either by creating working
copies on local disk (cf. Dropbox, Unison) or by offloading computing elsewhere
(cf. Google Docs, Overleaf). These solutions satisfy most users’ requirements
but also present considerable drawbacks, ranging from cognitive overhead from
tracking multiple copies to philosophical frustration from using closed-source
software.


Related Work
------------

No past effort to create a network file system that supports disconnected
operation has yet succeeded. Most have definitively failed, crushed under their
own complexity.

[Coda](http://coda.cs.cmu.edu/), a fork of AFS that aims to support disconnected
operation, has been under development since 1987. Dormant for many years, it has
recently received a burst of new activity. Reflecting its origins in the 1980s,
Coda’s code quality and RPC protocol leave something to be desired, though work
is occurring to bring it up to modern standards. Despite this work, Coda still
includes significant known security bugs and is not yet suitable for production
deployment on the internet.

[Bazil](https://bazil.org/) was an attempt to create a decentralized,
distributed file system that supported disconnected operation. Though the
project produced a high-quality [Go FUSE library](https://github.com/bazil/fuse),
development on Bazil itself seems to have stalled around 2015.

Robert Mustacchi’s bachelor’s thesis, [StashFS: Generalized Disconnected
Operation](https://cs.brown.edu/research/pubs/theses/ugrad/2010/mustacchi.pdf),
described a FUSE overlay file system that added disconnected operation to _any_
network file system. Because it operates as an overlay file system, StashFS is
somewhat constrained in the assumptions it can make, but it seems fundamentally
sound. Unfortunately, its code is not publicly available, and the author did not
respond to email asking for a copy.

[Dropbox](https://dropbox.com/) and [Google Drive](https://drive.google.com/)
are premier proprietary cloud storage offerings. However, they are
synchronization services, not file systems, which means they are fairly heavy on
disk usage. Furthermore, each uses proprietary server software and RPC
protocols, making them impossible to self-host and difficult to reliably extend.


Detailed Design
---------------

A laptopfs network consists of one server and zero or more clients. The server
holds the entire file system and passes hunks of it to interested clients; it
also notifies connected clients when the file system has changed. The clients
get hunks of the file system from the server, upload new hunks, and present the
contents of the file system to users via FUSE.


### The Server ###

Fundamentally, the laptopfs server implements an RPC-accessible key-value store.
It has a few extra bits to facilitate use of that store as a file system, but
the basic RPC interface to laptopfsd looks an awful like the interface to a hash
table: You can request a hunk of data by ID, or you can upload a new hunk of
data. Unlike most hash tables, however, clients see laptopfsd’s key-value store
as append-only: They cannot delete or modify entries in the store. This prevents
reader–writer races between clients. (Writer–writer races are addressed later.)

Using the append-only key-value store, it’s fairly easy to implement a
copy-on-write file system by differentiating between three types of hunks:

  - A data hunk stores raw data.

  - A file hunk is a hunk containing a serialized vector of IDs, offsets, and
    lengths, each triplet representing a slice of a data hunk. Retrieving those
    data hunks, slicing them appropriately, and concatenating the slices yields
    the contents of the file.

  - A directory hunk is a hunk containing a set of file hunk IDs as well as a
    name, mtime, mode, etc. for each file.

The server tracks the ID of the root directory, and clients can update that ID
using a special RPC. This is the one destructive operation that a client can
perform, and it has several safeguards built into it. Notably, an update-root
RPC does not simply request that the server update the root to a new value; it
requests that the server update the root _from_ an old value _to_ a new value.
If the old value does not match the current root ID, the server will reject the
transaction. This enables a client to detect that its change would have
clobbered another client’s changes, at which point the client can initiate a
merge process and try again.

Because mutating the file system necessarily involves at least two changes (one
to upload the new root directory and one to set it as the root), the server
allows clients to group mutations into transactions. To avoid round trips during
the transactions, the server requires clients to generate new IDs; it will
reject a transaction if the transaction attempts to use an ID already in use.

Since clients see the store as append-only, the server must garbage-collect it
periodically to avoid unbounded growth. This means the server needs to know the
links between hunks in the file system. To facilitate this – while leaving the
server as oblivious as possible to the actual contents of the file system –
clients use AEAD to reveal the relevant parts of file and directory hunks.

  - Data hunks encrypt everything, leaving the associated data region empty.

  - File hunks encrypt data hunk offsets and lengths, leaving the data hunk IDs
    in the associated data region.

  - Directory hunks encrypt file names, mtimes, modes, etc., leaving the file
    hunk IDs in the associated data region.

The server implements a simple mark-and-sweep garbage collection algorithm and
runs it at thresholds configurable by the administrator. During the collection,
the server will reject transactions that mutate the file system, but since this
is strictly more favorable to the client than completely disconnected operation,
clients can cope.

To allow clients to detect changes, the server allows clients to subscribe for
updates to the root.

The final RPC interface for the server, in pseudo-gRPC, thus looks like this:

    message Hunk {
      message AssociatedData {
        repeated bytes referenced_hunk_ids
      }

      bytes id
      AssociatedData associated_data [security = signed]
      bytes data [security = signed+encrypted]
    }

    message GetHunkRequest {
      repeated bytes ids
    }

    message GetHunkResponse {
      repeated Hunk hunks  // any requested hunks that exist
    }


    message TransactionOperation {
      message PutHunk {
        Hunk hunk
      }

      message UpdateRoot {
        bytes old_id
        bytes new_id
      }

      oneof op {
        PutHunk put_hunk
        UpdateRoot update_root
      }
    }

    message ExecuteTransactionRequest {
      repeated TransactionOperation ops;
    }

    message ExecuteTransactionResponse {}


    message RootIdRequest {
      bool include_hunk
    }

    message RootIdResponse {
      bytes new_root_id
      optional Hunk new_root
    }


    service Laptopfs {
      rpc GetHunk(GetHunkRequest) returns (GetHunkResponse) {}
      rpc ExecuteTransaction(ExecuteTransactionRequest) returns (ExecuteTransactionResponse) {}

      // This returns the current value immediately and then streams updates as
      // they happen.
      rpc RootId(RootIdRequest) returns (stream RootIdResponse) {}
    }


### The Client ###

Since the laptopfs server is oblivious to most data in the system, most
complexity appears in the laptopfs client. The client has three roles: It must
maintain a local disk cache of the state that exists on the server, it must
reveal that state to the user through a FUSE interface, and it must push user
changes back to the server.

The laptopfs client implements its on-disk cache using both a key-value store
(like the server’s) and a log. The key-value store is a pure cache of the server
state; the client never mutates it except in response to an announcement
received from the server. The log, then, stores a record of all mutations that
the client has made but not pushed to the server. This allows the client’s file
system bits to mimic the implementation of a log-structured file system, a
well-understood design that is both relatively simple and fairly performant.

One set of client worker threads are responsible for maintaining the client’s
store. At startup, the client starts a `RootId` RPC with the server to get the
current ID of the root directory. It then executes a breadth-first search of the
file system, fetching all directory and file hunks. Data hunks are not fetched
in advance – instead, they’re fetched as the user tries to read them. (The
client can apply whatever heuristics it wants to try to make reads as
low-latency as possible, but basic readahead, triggered on open(2), lseek(2),
and read(2), should cover most basic cases.) The end result is that while file
access might lag a bit, navigating through the file system itself should always
feel snappy. As the root directory ID changes, the client restarts its
breadth-first search, optionally after waiting for the file system to settle.

As the client receives hunks from the server, it decrypts and verifies their
integrity; before it uploads new hunks to the server, it encrypts them. Data
hunks are thus stored on disk simply as their contents, while file and directory
hunks have more structure:

    message Timestamp {
      uint64 seconds
      fixed32 nanoseconds
    }

    message FileHunk {
      message HunkSlice {
        uint64 offset
        uint64 length
      }

      repeated HunkSlice slices

      // The actual last time the data in the file changed.
      Timestamp true_modification_time
    }

    message DirectoryHunk {
      message FileInfo {
        message Xattr {
          bytes name
          bytes value
        }

        bytes name
        uint32 mode
        Timestamp mtime
        repeated Xattr xattrs
      }

      repeated FileInfo files

      // The actual last time the data in the directory hunk changed.
      Timestamp true_modification_time
}

As the user interacts with the FUSE frontend, the client computes a view of the
file system from replaying the log on top of the store and adds entries to the
log to correspond to user changes. Log entries are all timestamped; each can be
one of the following operations:

  - WRITE(data, path) (creating the file if it does not yet exist)

  - DELETE(path)

  - RENAME(path, path)

  - SETMTIME(path, timestamp)

  - CHMOD(path, mode)

  - SETXATTR(path, name, value)

The algebra for replaying these changes is based on “last one wins” semantics.

Every few seconds (with the delay configurable by the user), the client updates
the server on anything that has changed locally. The client first compacts the
log. It then computes the minimally sized transaction that will apply the
changes contained in the log to the store. Finally, it attempts to commit this
transaction to the server. If the transaction succeeds, the client drops the
uploaded changes from the log. If it fails, the client does nothing and tries
again later.

Like the server, the client uses a mark-and-sweep garbage collector to prevent
its store from growing arbitrarily large. The client also can drop data to keep
its disk usage small; it implements a simple least recently used scheme to
determine when to drop data the user is likely no longer interested in.

laptopfs supports one important operation beyond the standard Unix file system
API: the ability to pin files to local disk. The client maintains a list of
files the user would like to have locally at all times and fetches them whenever
it discovers they have changed. These files are exempt from cache drops.


### Future Work ###

This design is highly extensible. Provided clients and servers all implement the
RPC interface with the same semantics, multiple client and server
implementations can coexist. The RPC interface is even transport-agnostic, and
one could envision a client that supported multiple transports via plugins.

The client has virtually unlimited latitude to heuristically prefetch and drop
data, and heuristics will be an area of active exploration after the initial
implementation.


Security and Privacy Considerations
-----------------------------------

laptopfs encrypts most data stored in the file system using 256-bit AES-GCM with
a key accessible only to the user. This produces fairly good privacy properties:

  - Integrity: Neither the server nor an active adversary can modify data that
    the user has uploaded. A malicious server could serve old data, but clients
    can detect if it throws away a transaction (as it will not signal a root
    invalidation after a transaction successfully completes).

  - Confidentiality: The file system data are fully confidential, but metadata
    are not. In particular, the server knows the general tree structure of the
    file system and can estimate the number of files in each directory. The
    server does not, however, have access to file names and can only make
    educated guesses at file sizes.

  - Availability: The server or an active adversary can deny service at any
    time. On the other hand, laptopfs is designed to work with a disconnected
    server.


Work Estimates
--------------

You can always stop at any time.


TODO: We need a slightly better log interface – this one doesn’t do very well
when you’re trying to change a small part of a very large file (e.g., updating
metadata in a ripped Blu-Ray).