Add document on bazel rule pitfalls.

-- MOS_MIGRATED_REVID=125567367
author: Tobias Werth <twerth@google.com> 2016-06-22 15:23:56 +0000
committer: Lukacs Berki <lberki@google.com> 2016-06-23 11:02:59 +0000
commit: 590209f9c28e30c3643296a0284a046a9b7e8d36 (patch)
tree: cdf228036ffc1c16e509f06bfd895f0349a7f75b /site
parent: a5f6fa30595e882df7a694b9789dda339362a571 (diff)
3 files changed, 217 insertions, 1 deletions
diff --git a/site/_layouts/contribute.html b/site/_layouts/contribute.html
index 99eb29cc07..b0b29e0868 100644
--- a/site/_layouts/contribute.html
+++ b/site/_layouts/contribute.html
@@ -25,6 +25,7 @@ nav: contribute
           <nav class="sidebar collapse" id="sidebar-nav">
             <ul class="sidebar-nav">
               <li><a href="/contributing.html">Contributing to Bazel</a></li>
+              <li><a href="/hard_to_write_rules.html">Pitfalls</a></li>
               <li><a href="/users.html">Who's Using Bazel</a></li>
               <li><a href="/roadmap.html">Roadmap</a></li>
               <li><a href="/governance.html">Governance</a></li>
diff --git a/site/contributing.md b/site/contributing.md
index 41b0c84001..6ea17d48a5 100644
--- a/site/contributing.md
+++ b/site/contributing.md
@@ -139,7 +139,8 @@ Bazel is organized in several parts:
     you want to add rules, consider using [Skylark](docs/skylark/index.html)
     first.
   * Builtin rules in `com.google.devtools.build.lib.rules` and in
-    `com.google.devtools.build.lib.bazel.rules`.
+    `com.google.devtools.build.lib.bazel.rules`. You might want to read [Why is
+    it so difficult to write Bazel rules?](hard_to_write_rules.html) first.
 * Java native interfaces in `src/main/native`.
 * Various tooling for language support (see the list in the
   [compiling Bazel](#compile-bazel) section).
diff --git a/site/hard_to_write_rules.md b/site/hard_to_write_rules.md
new file mode 100644
index 0000000000..dd341d31ad
--- /dev/null
+++ b/site/hard_to_write_rules.md
@@ -0,0 +1,214 @@
+---
+layout: contribute
+title: Why is it so difficult to write Bazel rules?
+---
+
+# Why is it difficult to write Bazel rules?
+
+We have heard feedback from various people multiple times that Bazel rules are
+hard to write. There is no single root cause, but it’s due to a combination of
+historical circumstances and intrinsic complexity in the problem domain. This
+document attempts to give a high level overview of the specific issues that we
+believe to be the main contributors.
+
+* Assumption: Aim for Correctness, Throughput, Ease of Use & Latency
+* Assumption: Large Scale Repositories
+* Assumption: BUILD-like Description Language
+* Intrinsic: Remote Execution and Caching are Hard
+* Historic: Hard Separation between Loading, Analysis, and Execution is
+  Outdated, but still affects the API
+* Intrinsic: Using Change Information for Correct and Fast Incremental Builds
+  requires Unusual Coding Patterns
+* Intrinsic: Avoiding Quadratic Time and Memory Consumption is Hard
+
+## Assumption: Aim for Correctness, Throughput, Ease of Use & Latency
+
+We assume that the build system needs to be first and foremost correct with
+respect to incremental builds, i.e., for a given source tree, the output of the
+same build should always be the same, regardless of what the output tree looks
+like. In the first approximation, this means Bazel needs to know every single
+input that goes into a given build step, such that it can rerun that step if any
+of the inputs change. There are limits to how correct Bazel can get, as it leaks
+some information such as date / time of the build, and ignores certain types of
+changes such as changes to file attributes. Sandboxing helps ensure correctness
+by preventing reads to undeclared input files. Besides the intrinsic limits of
+the system, there are a few known correctness issues, most of which are related
+to Fileset or the C++ rules, which are both hard problems. We have long-term
+efforts to fix these.
+
+The second goal of the build system is to have high throughput; we are
+permanently pushing the boundaries of what can be done within the current
+machine allocation for a remote execution service. If the remote execution
+service gets overloaded, nobody can get work done.
+
+Ease of use comes next, i.e., of multiple correct approaches with the same (or
+similar) footprint of the remote execution service, we choose the one that is
+easier to use.
+
+For the purpose of this document, latency denotes the time it takes from
+starting a build to getting the intended result, whether that is a test log from
+a passing or failing test, or an error message that a BUILD file has a
+typo.
+
+Note that these goals often overlap; latency is as much a function of throughput
+of the remote execution service as is correctness relevant for ease of use.
+
+
+## Assumption: Large Scale Repositories
+
+The build system needs to operate at the scale of large repositories where large
+scale means that it does not fit on a single hard drive, so it is impossible to
+do a full checkout on virtually all developer machines. A medium-sized build
+will need to read and parse tens of thousands of BUILD files, and evaluate
+hundreds of thousands of globs. While it is theoretically possible to read all
+BUILD files on a single machine, we have not yet been able to do so within a
+reasonable amount of time and memory. As such, it is critical that BUILD files
+can be loaded and parsed independently.
+
+
+## Assumption: BUILD-like Description Language
+
+For the purpose of this document, we assume a configuration language that is
+roughly similar to BUILD files, i.e., declaration of library and binary rules
+and their interdependencies. BUILD files can be read and parsed independently,
+and we avoid even looking at source files whenever we can (except for
+existence).
+
+
+## Intrinsic: Remote Execution and Caching are Hard
+
+Remote execution and caching improve build times in large repositories by
+roughly two orders of magnitude compared to running the build on a single
+machine. However, the scale at which it needs to perform is staggering: Google's
+remote execution service is designed to handle a huge number of requests per
+second, and the protocol carefully avoids unnecessary roundtrips as well as
+unnecessary work on the service side.
+
+At this time, the protocol requires that the build system knows all inputs to a
+given action ahead of time; the build system then computes a unique action
+fingerprint, and asks the scheduler for a cache hit. If a cache hit is found,
+the scheduler replies with the digests of the output files; the files itself are
+addressed by digest later on. However, this imposes restrictions on the Bazel
+rules, which need to declare all input files ahead of time.
+
+
+## Historic: Hard Separation between Loading, Analysis, and Execution is Outdated, but still affects the API
+
+Technically, it is sufficient for a rule to know the input and output files of
+an action just before the action is sent to remote execution. However, the
+original Bazel code base had a strict separation of loading packages, then
+analyzing rules using a configuration (command-line flags, essentially), and
+only then running any actions. This distinction is still part of the rules API
+today, even though the core of Bazel no longer requires it (more details below).
+
+That means that the rules API requires a declarative description of the rule
+interface (what attributes it has, types of attributes). There are some
+exceptions where the API allows custom code to run during the loading phase to
+compute implicit names of output files and implicit values of attributes. For
+example, a java_library rule named ‘foo’ implicitly generates an output named
+‘libfoo.jar’, which can be referenced from other rules in the build graph.
+
+Furthermore, the analysis of a rule cannot read any source files or inspect the
+output of an action; instead, it needs to generate a partial directed bipartite
+graph of build steps and output file names that is only determined from the rule
+itself and its dependencies.
+
+
+## Intrinsic: Using Change Information for Correct and Fast Incremental Builds requires Unusual Coding Patterns
+
+Above, we argued that in order to be correct, Bazel needs to know all the input
+files that go into a build step in order to detect whether that build step is
+still up-to-date. The same is true for package loading and rule analysis, and we
+have designed [Skyframe] (http://www.bazel.io/docs/skyframe.html) to handle this
+in general. Skyframe is a graph library and evaluation framework that takes a
+goal node (such as ‘build //foo with these options’), and breaks it down into
+its constituent parts, which are then evaluated and combined to yield this
+result. As part of this process, Skyframe reads packages, analyzes rules, and
+executes actions.
+
+At each node, Skyframe tracks exactly which nodes any given node used to compute
+its own output, all the way from the goal node down to the input files (which
+are also Skyframe nodes). Having this graph explicitly represented in memory
+allows the build system to identify exactly which nodes are affected by a given
+change to an input file (including creation or deletion of an input file), doing
+the minimal amount of work to restore the output tree to its intended state.
+
+As part of this, each node performs a dependency discovery process; i.e., each
+node can declare dependencies, and then use the contents of those dependencies
+to declare even further dependencies. In principle, this maps well to a
+thread-per-node model. However, medium-sized builds contain hundreds of
+thousands of Skyframe nodes, which isn’t easily possible with current Java
+technology (and for historical reasons, we’re currently tied to using Java, so
+no lightweight threads and no continuations).
+
+Instead, Bazel uses a fixed-size thread pool. However, that means that if a node
+declares a dependency that isn’t available yet, we may have to abort that
+evaluation and restart it (possibly in another thread), when the dependency is
+available. This, in turn, means that nodes should not do this excessively; a
+node that declares N dependencies serially can potentially be restarted N times,
+costing O(N^2) time. Instead, we aim for up-front bulk declaration of
+dependencies, which sometimes requires reorganizing the code, or even splitting
+a node into multiple nodes to limit the number of restarts.
+
+Note that this technology isn’t currently available in the rules API; instead,
+the rules API is still defined using the legacy concepts of loading, analysis,
+and execution phases. However, a fundamental restriction is that all accesses to
+other nodes have to go through the framework so that it can track the
+corresponding dependencies. Regardless of the language in which the build system
+is implemented or in which the rules are written (they don’t have to be the
+same), rule authors must not use standard libraries or patterns that bypass
+Skyframe. For Java, that means avoiding java.io.File as well as any form of
+reflection, and any library that does either. Libraries that support dependency
+injection of these low-level interfaces still need to be setup correctly for
+Skyframe.
+
+This strongly suggests to avoid exposing rule authors to a full language runtime
+in the first place. The danger of accidental use of such APIs is just too big -
+several Bazel bugs in the past were caused by rules using unsafe APIs, even
+though the rules were written by the Bazel team, i.e., by the domain experts.
+
+
+## Intrinsic: Avoiding Quadratic Time and Memory Consumption is Hard
+
+To make matters worse, apart from the requirements imposed by Skyframe, the
+historical constraints of using Java, and the outdatedness of the rules API,
+accidentally introducting quadratic time or memory consumption is a fundamental
+problem in any build system based on library and binary rules. There are two
+very common patterns that introduce quadratic memory consumption (and therefore
+quadratic time consumption).
+
+1. Chains of Library Rules
+Consider the case of a chain of library rules A depends on B, depends on C, and
+so on. Then, we want to compute some property over the transitive closure of
+these rules, such as the Java runtime classpath, or the C++ linker command for
+each library. Naively, we might take a standard list implementation; however,
+this already introduces quadratic memory consumption: the first library
+contains one entry on the classpath, the second two, the third three, and so
+on, for a total of 1+2+3+...+N = O(N^2) entries.
+
+2. Binary Rules Depending on the Same Library Rules
+Consider the case where a set of binaries that depend on the same library
+rules; for example, you might have a number of test rules that test the same
+library code. Let’s say out of N rules, half the rules are binary rules, and
+the other half library rules. Now consider that each binary makes a copy of
+some property computed over the transitive closure of library rules, such as
+the Java runtime classpath, or the C++ linker command line. For example, it
+could expand the command line string representation of the C++ link action. N/2
+copies of N/2 elements is O(N^2) memory.
+
+
+### Custom Collections Classes to Avoid Quadratic Complexity
+
+Bazel is heavily affected by both of these scenarios, so we introduced a set of
+custom collection classes that effectively compress the information in memory by
+avoiding the copy at each step. Almost all of these data structures have set
+semantics, so we called the class NestedSet. The majority of changes to reduce
+Bazel’s memory consumption over the past several years were changes to use
+NestedSet instead of whatever was previously used.
+
+Unfortunately, usage of NestedSet does not automatically solve all the issues;
+in particular, even just iterating over a NestedSet in each rule re-introduces
+quadratic time consumption. NestedSet also has some helper methods to facilitate
+interoperability with normal collections classes; unfortunately, accidentally
+passing a NestedSet to one of these methods leads to copying behavior, and
+reintroduces quadratic memory consumption.
author	Tobias Werth <twerth@google.com>	2016-06-22 15:23:56 +0000
committer	Lukacs Berki <lberki@google.com>	2016-06-23 11:02:59 +0000
commit	590209f9c28e30c3643296a0284a046a9b7e8d36 (patch)
tree	cdf228036ffc1c16e509f06bfd895f0349a7f75b /site
parent	a5f6fa30595e882df7a694b9789dda339362a571 (diff)