Improve analysis of the number of extra `==` operations, which was overly complicated, slightly incorrect.

The old analysis viewed it as birthday attack, which asks how often there are multiple values in the probe same probe sequence with the same H2. In their own words, this analysis "breaks down" at around `n = 12`. Instead we can answer a simpler question, which is, if a probe sequence examines `k` objects that are not what we are looking for, what's the number of calls `==`? The expectation is simply `k/128`. PiperOrigin-RevId: 446518063 Change-Id: Ie879bd4f6c97979822bc9d550b9e2503b1418c78
author: Abseil Team <absl-team@google.com> 2022-05-04 12:04:37 -0700
committer: Copybara-Service <copybara-worker@google.com> 2022-05-04 12:05:20 -0700
commit: 384a25d5e19228ceb7641676aefd58c4ca7a9568 (patch)
tree: d0757888b1415bd2d3bd3f0ea8c027323fe26b2d /absl/container/internal/raw_hash_set.h
parent: 02934b492135d7c9c40fecf8e7873380b46c2223 (diff)
1 files changed, 27 insertions, 31 deletions
diff --git a/absl/container/internal/raw_hash_set.h b/absl/container/internal/raw_hash_set.h
index 046a6939..b69a77f0 100644
--- a/absl/container/internal/raw_hash_set.h
+++ b/absl/container/internal/raw_hash_set.h
@@ -95,14 +95,21 @@
 // Storing control bytes in a separate array also has beneficial cache effects,
 // since more logical slots will fit into a cache line.
 //
+// # Hashing
+//
+// We compute two separate hashes, `H1` and `H2`, from the hash of an object.
+// `H1(hash(x))` is an index into `slots`, and essentially the starting point
+// for the probe sequence. `H2(hash(x))` is a 7-bit value used to filter out
+// objects that cannot possibly be the one we are looking for.
+//
 // # Table operations.
 //
-// The key operations are `insert`, `find`, and `erase_at`; the operations below
+// The key operations are `insert`, `find`, and `erase`.
 //
-// `insert` and `erase` are implemented in terms of find, so we describe that
-// one first. To `find` a value `x`, we compute `hash(x)`. From `H1(hash(x))`
-// and the capacity, we construct a `probe_seq` that visits every group of
-// slots in some interesting order.
+// Since `insert` and `erase` are implemented in terms of `find`, we describe
+// `find` first. To `find` a value `x`, we compute `hash(x)`. From
+// `H1(hash(x))` and the capacity, we construct a `probe_seq` that visits every
+// group of slots in some interesting order.
 //
 // We now walk through these indices. At each index, we select the entire group
 // starting with that index and extract potential candidates: occupied slots
@@ -112,32 +119,21 @@
 // next probe index. Tombstones effectively behave like full slots that never
 // match the value we're looking for.
 //
-// The `H2` bits ensure that if we perform a ==, a false positive is very very
-// rare (assuming the hash function looks enough like a random oracle). To see
-// this, note that in a group, there will be at most 8 or 16 `H2` values, but
-// an `H2` can be any one of 128 values. Viewed as a birthday attack, we can use
-// the rule of thumb that the probability of a collision among n choices of m
-// symbols is `p(n, m) ~ n^2/2m. In table form:
-//
-//  n |  p(n) |  n |  p(n)
-//  0 | 0.000 |  8 | 0.250
-//  1 | 0.004 |  9 | 0.316
-//  2 | 0.016 | 10 | 0.391
-//  3 | 0.035 | 11 | 0.473
-//  4 | 0.062 | 12 | 0.562
-//  5 | 0.098 | 13 | 0.660
-//  6 | 0.141 | 14 | 0.766
-//  7 | 0.191 | 15 | 0.879
-//
-// The rule of thumb breaks down at around `n = 12`, but such groups would only
-// occur for tables close to their max load factor. This is far better than an
-// ordinary open-addressing table, which needs to perform an == at every step of
-// the probe sequence. These probabilities don't tell the full story (for
-// example, because elements are inserted into a group from the front, and
-// candidates are =='ed from the front, the collision is only effective in
-// rare cases e.g. another probe sequence inserted into a deleted slot in front
-// of us).
-//
+// The `H2` bits ensure when we compare a slot to an object with `==`, we are
+// likely to have actually found the object.  That is, the chance is low that
+// `==` is called and returns `false`.  Thus, when we search for an object, we
+// are unlikely to call `==` many times.  This likelyhood can be analyzed as
+// follows (assuming that H2 is a random enough hash function).
+//
+// Let's assume that there are `k` "wrong" objects that must be examined in a
+// probe sequence.  For example, when doing a `find` on an object that is in the
+// table, `k` is the number of objects between the start of the probe sequence
+// and the final found object (not including the final found object).  The
+// expected number of objects with an H2 match is then `k/128`.  Measurements
+// and analysis indicate that even at high load factors, `k` is less than 32,
+// meaning that the number of "false positive" comparisons we must perform is
+// less than 1/8 per `find`.
+
 // `insert` is implemented in terms of `unchecked_insert`, which inserts a
 // value presumed to not be in the table (violating this requirement will cause
 // the table to behave erratically). Given `x` and its hash `hash(x)`, to insert
author	Abseil Team <absl-team@google.com>	2022-05-04 12:04:37 -0700
committer	Copybara-Service <copybara-worker@google.com>	2022-05-04 12:05:20 -0700
commit	384a25d5e19228ceb7641676aefd58c4ca7a9568 (patch)
tree	d0757888b1415bd2d3bd3f0ea8c027323fe26b2d /absl/container/internal/raw_hash_set.h
parent	02934b492135d7c9c40fecf8e7873380b46c2223 (diff)