summaryrefslogtreecommitdiff
path: root/absl/container/internal/raw_hash_set.h
diff options
context:
space:
mode:
Diffstat (limited to 'absl/container/internal/raw_hash_set.h')
-rw-r--r--absl/container/internal/raw_hash_set.h58
1 files changed, 27 insertions, 31 deletions
diff --git a/absl/container/internal/raw_hash_set.h b/absl/container/internal/raw_hash_set.h
index 046a6939..b69a77f0 100644
--- a/absl/container/internal/raw_hash_set.h
+++ b/absl/container/internal/raw_hash_set.h
@@ -95,14 +95,21 @@
// Storing control bytes in a separate array also has beneficial cache effects,
// since more logical slots will fit into a cache line.
//
+// # Hashing
+//
+// We compute two separate hashes, `H1` and `H2`, from the hash of an object.
+// `H1(hash(x))` is an index into `slots`, and essentially the starting point
+// for the probe sequence. `H2(hash(x))` is a 7-bit value used to filter out
+// objects that cannot possibly be the one we are looking for.
+//
// # Table operations.
//
-// The key operations are `insert`, `find`, and `erase_at`; the operations below
+// The key operations are `insert`, `find`, and `erase`.
//
-// `insert` and `erase` are implemented in terms of find, so we describe that
-// one first. To `find` a value `x`, we compute `hash(x)`. From `H1(hash(x))`
-// and the capacity, we construct a `probe_seq` that visits every group of
-// slots in some interesting order.
+// Since `insert` and `erase` are implemented in terms of `find`, we describe
+// `find` first. To `find` a value `x`, we compute `hash(x)`. From
+// `H1(hash(x))` and the capacity, we construct a `probe_seq` that visits every
+// group of slots in some interesting order.
//
// We now walk through these indices. At each index, we select the entire group
// starting with that index and extract potential candidates: occupied slots
@@ -112,32 +119,21 @@
// next probe index. Tombstones effectively behave like full slots that never
// match the value we're looking for.
//
-// The `H2` bits ensure that if we perform a ==, a false positive is very very
-// rare (assuming the hash function looks enough like a random oracle). To see
-// this, note that in a group, there will be at most 8 or 16 `H2` values, but
-// an `H2` can be any one of 128 values. Viewed as a birthday attack, we can use
-// the rule of thumb that the probability of a collision among n choices of m
-// symbols is `p(n, m) ~ n^2/2m. In table form:
-//
-// n | p(n) | n | p(n)
-// 0 | 0.000 | 8 | 0.250
-// 1 | 0.004 | 9 | 0.316
-// 2 | 0.016 | 10 | 0.391
-// 3 | 0.035 | 11 | 0.473
-// 4 | 0.062 | 12 | 0.562
-// 5 | 0.098 | 13 | 0.660
-// 6 | 0.141 | 14 | 0.766
-// 7 | 0.191 | 15 | 0.879
-//
-// The rule of thumb breaks down at around `n = 12`, but such groups would only
-// occur for tables close to their max load factor. This is far better than an
-// ordinary open-addressing table, which needs to perform an == at every step of
-// the probe sequence. These probabilities don't tell the full story (for
-// example, because elements are inserted into a group from the front, and
-// candidates are =='ed from the front, the collision is only effective in
-// rare cases e.g. another probe sequence inserted into a deleted slot in front
-// of us).
-//
+// The `H2` bits ensure when we compare a slot to an object with `==`, we are
+// likely to have actually found the object. That is, the chance is low that
+// `==` is called and returns `false`. Thus, when we search for an object, we
+// are unlikely to call `==` many times. This likelyhood can be analyzed as
+// follows (assuming that H2 is a random enough hash function).
+//
+// Let's assume that there are `k` "wrong" objects that must be examined in a
+// probe sequence. For example, when doing a `find` on an object that is in the
+// table, `k` is the number of objects between the start of the probe sequence
+// and the final found object (not including the final found object). The
+// expected number of objects with an H2 match is then `k/128`. Measurements
+// and analysis indicate that even at high load factors, `k` is less than 32,
+// meaning that the number of "false positive" comparisons we must perform is
+// less than 1/8 per `find`.
+
// `insert` is implemented in terms of `unchecked_insert`, which inserts a
// value presumed to not be in the table (violating this requirement will cause
// the table to behave erratically). Given `x` and its hash `hash(x)`, to insert