laptopfs ======== Benjamin Barenblat, October 16, 2021 Objective --------- Build a highly available network file system. Requirements ------------ laptopfs is a network file system designed for use on laptop home directories. This means it is - a file system, allowing the users to manipulate files and directories using established working patterns. - highly available and partition-resistant, supporting offline operation just as well as it supports online operation. - secure by default, ensuring confidentiality and integrity whenever data leave the laptop, including when they are stored on file servers. - designed for multiple concurrent active sessions. - disk- and bandwidth-efficient, transferring and storing only the data users actually need. laptopfs is not - designed for collaboration. If two clients modify the file system at the same time, the server will reject one client’s write; while the unfortunate client will likely succeed on a second attempt, this does not scale well and should generally be avoided. - a completely POSIX-compatible file system. In particular, laptopfs does not support hard links, device nodes, or IPC mechanisms (named pipes, sockets, etc.). Out of Scope ------------ Though laptopfs encrypts data whenever they leave the laptop and ensures the user’s data are encrypted remotely, it does not encrypt data while they are stored on the local disk. If a user requires this level of security, they should leverage the functionality built into their operating system. laptopfs clients use file system modification times to resolve certain types of writer–writer races. To ensure that races are resolved in an unsurprising way, clients should be well-synchronized to a coherent clock source. NTP (or even SNTP) should be sufficient for this. Background ---------- Distributed computing has captured both public imagination and corporate attention for over half a century. The stories of Arthur C. Clarke envision a distant future where mainframes are shared between entire neighborhoods. Later, Sun Microsystems promised a distributed future, proclaiming that “The network is the computer”. So-called distributed operating systems like Plan 9 from Bell Labs continue to draw attention decades after their development mostly stopped. However, the promised land of distributed computing has mostly failed to materialize. Comparing visions against reality, it’s easy to see why: Today’s world consists not of a vast network of computers communicating at the speed of light, but rather a vast archipelago of high-speed, high-reliability network fabrics joined by comparatively slow, sketchy, and expensive links. It is telling that even security-conscious corporations have had to make firewall exceptions for Mosh, a remote terminal protocol that supports low bandwidth and high latency, and that the world’s most popular “distributed” version control system, Git, is the one that allows users to work productively with no network connection at all. While researchers and developers have made incredible progress building fast and secure file systems that operate on individual machines and on high-speed network fabrics, there currently exists no file system that truly embraces the archipelago model. NFS and AFS can be configured to support low bandwidth and high latency, but neither allows disconnected operation. Past efforts to add support for true disconnected operation to existing network file systems have stalled. Most users thus access their remote files either by creating working copies on local disk (cf. Dropbox, Unison) or by offloading computing elsewhere (cf. Google Docs, Overleaf). These solutions satisfy most users’ requirements but also present considerable drawbacks, ranging from cognitive overhead from tracking multiple copies to philosophical frustration from using closed-source software. Related Work ------------ No past effort to create a network file system that supports disconnected operation has yet succeeded. Most have definitively failed, crushed under their own complexity. [Coda](http://coda.cs.cmu.edu/), a fork of AFS that aims to support disconnected operation, has been under development since 1987. Dormant for many years, it has recently received a burst of new activity. Reflecting its origins in the 1980s, Coda’s code quality and RPC protocol leave something to be desired, though work is occurring to bring it up to modern standards. Despite this work, Coda still includes significant known security bugs and is not yet suitable for production deployment on the internet. [Bazil](https://bazil.org/) was an attempt to create a decentralized, distributed file system that supported disconnected operation. Though the project produced a high-quality [Go FUSE library](https://github.com/bazil/fuse), development on Bazil itself seems to have stalled around 2015. Robert Mustacchi’s bachelor’s thesis, [StashFS: Generalized Disconnected Operation](https://cs.brown.edu/research/pubs/theses/ugrad/2010/mustacchi.pdf), described a FUSE overlay file system that added disconnected operation to _any_ network file system. Because it operates as an overlay file system, StashFS is somewhat constrained in the assumptions it can make, but it seems fundamentally sound. Unfortunately, its code is not publicly available, and the author did not respond to email asking for a copy. [Dropbox](https://dropbox.com/) and [Google Drive](https://drive.google.com/) are premier proprietary cloud storage offerings. However, they are synchronization services, not file systems, which means they are fairly heavy on disk usage. Furthermore, each uses proprietary server software and RPC protocols, making them impossible to self-host and difficult to reliably extend. Detailed Design --------------- A laptopfs network consists of one server and zero or more clients. The server holds the entire file system and passes hunks of it to interested clients; it also notifies connected clients when the file system has changed. The clients get hunks of the file system from the server, upload new hunks, and present the contents of the file system to users via FUSE. ### The Server ### Fundamentally, the laptopfs server implements an RPC-accessible key-value store. It has a few extra bits to facilitate use of that store as a file system, but the basic RPC interface to laptopfsd looks an awful like the interface to a hash table: You can request a hunk of data by ID, or you can upload a new hunk of data. Unlike most hash tables, however, clients see laptopfsd’s key-value store as append-only: They cannot delete or modify entries in the store. This prevents reader–writer races between clients. (Writer–writer races are addressed later.) Using the append-only key-value store, it’s fairly easy to implement a copy-on-write file system by differentiating between three types of hunks: - A data hunk stores raw data. - A file hunk is a hunk containing a serialized vector of IDs, offsets, and lengths, each triplet representing a slice of a data hunk. Retrieving those data hunks, slicing them appropriately, and concatenating the slices yields the contents of the file. - A directory hunk is a hunk containing a set of file hunk IDs as well as a name, mtime, mode, etc. for each file. The server tracks the ID of the root directory, and clients can update that ID using a special RPC. This is the one destructive operation that a client can perform, and it has several safeguards built into it. Notably, an update-root RPC does not simply request that the server update the root to a new value; it requests that the server update the root _from_ an old value _to_ a new value. If the old value does not match the current root ID, the server will reject the transaction. This enables a client to detect that its change would have clobbered another client’s changes, at which point the client can initiate a merge process and try again. Because mutating the file system necessarily involves at least two changes (one to upload the new root directory and one to set it as the root), the server allows clients to group mutations into transactions. To avoid round trips during the transactions, the server requires clients to generate new IDs; it will reject a transaction if the transaction attempts to use an ID already in use. Since clients see the store as append-only, the server must garbage-collect it periodically to avoid unbounded growth. This means the server needs to know the links between hunks in the file system. To facilitate this – while leaving the server as oblivious as possible to the actual contents of the file system – clients use AEAD to reveal the relevant parts of file and directory hunks. - Data hunks encrypt everything, leaving the associated data region empty. - File hunks encrypt data hunk offsets and lengths, leaving the data hunk IDs in the associated data region. - Directory hunks encrypt file names, mtimes, modes, etc., leaving the file hunk IDs in the associated data region. The server implements a simple mark-and-sweep garbage collection algorithm and runs it at thresholds configurable by the administrator. During the collection, the server will reject transactions that mutate the file system, but since this is strictly more favorable to the client than completely disconnected operation, clients can cope. To allow clients to detect changes, the server allows clients to subscribe for updates to the root. The final RPC interface for the server, in pseudo-gRPC, thus looks like this: message Hunk { message AssociatedData { repeated bytes referenced_hunk_ids } bytes id AssociatedData associated_data [security = signed] bytes data [security = signed+encrypted] } message GetHunkRequest { repeated bytes ids } message GetHunkResponse { repeated Hunk hunks // any requested hunks that exist } message TransactionOperation { message PutHunk { Hunk hunk } message UpdateRoot { bytes old_id bytes new_id } oneof op { PutHunk put_hunk UpdateRoot update_root } } message ExecuteTransactionRequest { repeated TransactionOperation ops; } message ExecuteTransactionResponse {} message RootIdRequest { bool include_hunk } message RootIdResponse { bytes new_root_id optional Hunk new_root } service Laptopfs { rpc GetHunk(GetHunkRequest) returns (GetHunkResponse) {} rpc ExecuteTransaction(ExecuteTransactionRequest) returns (ExecuteTransactionResponse) {} // This returns the current value immediately and then streams updates as // they happen. rpc RootId(RootIdRequest) returns (stream RootIdResponse) {} } ### The Client ### Since the laptopfs server is oblivious to most data in the system, most complexity appears in the laptopfs client. The client has three roles: It must maintain a local disk cache of the state that exists on the server, it must reveal that state to the user through a FUSE interface, and it must push user changes back to the server. The laptopfs client implements its on-disk cache using both a key-value store (like the server’s) and a log. The key-value store is a pure cache of the server state; the client never mutates it except in response to an announcement received from the server. The log, then, stores a record of all mutations that the client has made but not pushed to the server. This allows the client’s file system bits to mimic the implementation of a log-structured file system, a well-understood design that is both relatively simple and fairly performant. One set of client worker threads are responsible for maintaining the client’s store. At startup, the client starts a `RootId` RPC with the server to get the current ID of the root directory. It then executes a breadth-first search of the file system, fetching all directory and file hunks. Data hunks are not fetched in advance – instead, they’re fetched as the user tries to read them. (The client can apply whatever heuristics it wants to try to make reads as low-latency as possible, but basic readahead, triggered on open(2), lseek(2), and read(2), should cover most basic cases.) The end result is that while file access might lag a bit, navigating through the file system itself should always feel snappy. As the root directory ID changes, the client restarts its breadth-first search, optionally after waiting for the file system to settle. As the client receives hunks from the server, it decrypts and verifies their integrity; before it uploads new hunks to the server, it encrypts them. Data hunks are thus stored on disk simply as their contents, while file and directory hunks have more structure: message Timestamp { uint64 seconds fixed32 nanoseconds } message FileHunk { message HunkSlice { uint64 offset uint64 length } repeated HunkSlice slices // The actual last time the data in the file changed. Timestamp true_modification_time } message DirectoryHunk { message FileInfo { message Xattr { bytes name bytes value } bytes name uint32 mode Timestamp mtime repeated Xattr xattrs } repeated FileInfo files // The actual last time the data in the directory hunk changed. Timestamp true_modification_time } As the user interacts with the FUSE frontend, the client computes a view of the file system from replaying the log on top of the store and adds entries to the log to correspond to user changes. Log entries are all timestamped; each can be one of the following operations: - WRITE(data, path) (creating the file if it does not yet exist) - DELETE(path) - RENAME(path, path) - SETMTIME(path, timestamp) - CHMOD(path, mode) - SETXATTR(path, name, value) The algebra for replaying these changes is based on “last one wins” semantics. Every few seconds (with the delay configurable by the user), the client updates the server on anything that has changed locally. The client first compacts the log. It then computes the minimally sized transaction that will apply the changes contained in the log to the store. Finally, it attempts to commit this transaction to the server. If the transaction succeeds, the client drops the uploaded changes from the log. If it fails, the client does nothing and tries again later. Like the server, the client uses a mark-and-sweep garbage collector to prevent its store from growing arbitrarily large. The client also can drop data to keep its disk usage small; it implements a simple least recently used scheme to determine when to drop data the user is likely no longer interested in. laptopfs supports one important operation beyond the standard Unix file system API: the ability to pin files to local disk. The client maintains a list of files the user would like to have locally at all times and fetches them whenever it discovers they have changed. These files are exempt from cache drops. ### Future Work ### This design is highly extensible. Provided clients and servers all implement the RPC interface with the same semantics, multiple client and server implementations can coexist. The RPC interface is even transport-agnostic, and one could envision a client that supported multiple transports via plugins. The client has virtually unlimited latitude to heuristically prefetch and drop data, and heuristics will be an area of active exploration after the initial implementation. Security and Privacy Considerations ----------------------------------- laptopfs encrypts most data stored in the file system using 256-bit AES-GCM with a key accessible only to the user. This produces fairly good privacy properties: - Integrity: Neither the server nor an active adversary can modify data that the user has uploaded. A malicious server could serve old data, but clients can detect if it throws away a transaction (as it will not signal a root invalidation after a transaction successfully completes). - Confidentiality: The file system data are fully confidential, but metadata are not. In particular, the server knows the general tree structure of the file system and can estimate the number of files in each directory. The server does not, however, have access to file names and can only make educated guesses at file sizes. - Availability: The server or an active adversary can deny service at any time. On the other hand, laptopfs is designed to work with a disconnected server. Work Estimates -------------- You can always stop at any time. TODO: We need a slightly better log interface – this one doesn’t do very well when you’re trying to change a small part of a very large file (e.g., updating metadata in a ripped Blu-Ray).