diff options
-rw-r--r-- | doc/connection-backoff.md | 57 |
1 files changed, 19 insertions, 38 deletions
diff --git a/doc/connection-backoff.md b/doc/connection-backoff.md index 47b71f927b..7094e737c5 100644 --- a/doc/connection-backoff.md +++ b/doc/connection-backoff.md @@ -8,58 +8,39 @@ requests) and instead do some form of exponential backoff. We have several parameters: 1. INITIAL_BACKOFF (how long to wait after the first failure before retrying) 2. MULTIPLIER (factor with which to multiply backoff after a failed retry) - 3. MAX_BACKOFF (Upper bound on backoff) - 4. MIN_CONNECTION_TIMEOUT + 3. MAX_BACKOFF (upper bound on backoff) + 4. MIN_CONNECT_TIMEOUT (minimum time we're willing to give a connection to + complete) ## Proposed Backoff Algorithm Exponentially back off the start time of connection attempts up to a limit of -MAX_BACKOFF. +MAX_BACKOFF, with jitter. ``` ConnectWithBackoff() current_backoff = INITIAL_BACKOFF current_deadline = now() + INITIAL_BACKOFF - while (TryConnect(Max(current_deadline, MIN_CONNECT_TIMEOUT)) + while (TryConnect(Max(current_deadline, now() + MIN_CONNECT_TIMEOUT)) != SUCCESS) SleepUntil(current_deadline) current_backoff = Min(current_backoff * MULTIPLIER, MAX_BACKOFF) - current_deadline = now() + current_backoff -``` - -## Historical Algorithm in Stubby - -Exponentially increase up to a limit of MAX_BACKOFF the intervals between -connection attempts. This is what stubby 2 uses, and is equivalent if -TryConnect() fails instantly. + current_deadline = now() + current_backoff + + UniformRandom(-JITTER * current_backoff, JITTER * current_backoff) ``` -LegacyConnectWithBackoff() - current_backoff = INITIAL_BACKOFF - while (TryConnect(MIN_CONNECT_TIMEOUT) != SUCCESS) - SleepFor(current_backoff) - current_backoff = Min(current_backoff * MULTIPLIER, MAX_BACKOFF) -``` - -The grpc C implementation currently uses this approach with an initial backoff -of 1 second, multiplier of 2, and maximum backoff of 120 seconds. (This will -change) -Stubby, or at least rpc2, uses exactly this algorithm with an initial backoff -of 1 second, multiplier of 1.2, and a maximum backoff of 120 seconds. +With specific parameters of +MIN_CONNECT_TIMEOUT = 20 seconds +INITIAL_BACKOFF = 1 second +MULTIPLIER = 1.6 +MAX_BACKOFF = 120 seconds +JITTER = 0.2 -## Use Cases to Consider +Implementations with pressing concerns (such as minimizing the number of wakeups +on a mobile phone) may wish to use a different algorithm, and in particular +different jitter logic. -* Client tries to connect to a server which is down for multiple hours, eg for - maintenance -* Client tries to connect to a server which is overloaded -* User is bringing up both a client and a server at the same time - * In particular, we would like to avoid a large unnecessary delay if the - client connects to a server which is about to come up -* Client/server are misconfigured such that connection attempts always fail - * We want to make sure these don’t put too much load on the server by - default. -* Server is overloaded and wants to transiently make clients back off -* Application has out of band reason to believe a server is back - * We should consider an out of band mechanism for the client to hint that - we should short circuit the backoff. +Alternate implementations must ensure that connection backoffs started at the +same time disperse, and must not attempt connections substantially more often +than the above algorithm. |