Exponential Backoff

Summary: A common failure in distributed systems is a server with a rate limit or with no limit but begins failing due to load. A standard solution is to retry after waiting a small time, increasing that time after each failure. We create a macro to handle this waiting and retrying.

A few days ago I wrote about a high-level way of handing intermittent errors, particularly in a distributed system. The way was simplistic: when you get an error, try again, up to a few errors. A slightly more nuanced approach is to back off before you try again. Each time there's an error, you wait longer, until some maximum time is reached.

The problem

Let's say you're hitting a service with a rate limit. That rate limit could be enforced or implicit^1. You've got lots of computers hitting it, and it's impossible to coordinate. No matter how hard you try to keep under that rate limit (and you should try), you will eventually break the limit. Retrying immediately when the server is too busy will actually make the problem worse. You will give it yet another request to deny. At the same time, it might be hard to distinguish "I'm too busy right now" from "I'm never going to recover".

The solution

I don't know what it's really called. I call it Exponential Backoff. It's also easy to turn into a separate routine:

    (defn exponential-backoff [time rate max f]
      (if (>= time max) ;; we're over budget, just call f
        (f)
        (try
          (f)
          (catch Throwable t
            (Thread/sleep time)
            (exponential-backoff (* time rate) rate max f)))))

This one has the same structure as try-n-times but will sleep before recursing. When it recurses, the time is multiplied by the rate. And when the last wait is more than the max, it will try one more time. Failures from that last try will propagate.

How to use it

Same as with try-n-times:

    (exponential-backoff 1000 2 10000
      #(http/get "http://rate-limited.com/resource"
                 {:socket-timeout 1000
                  :conn-timeout   1000}))

This will retry after waiting 1 second (1000 ms) the first time, then double it (the 2) each time. When it waits 10 seconds, it won't retry any more.

Slightly more useful

Ok, so I don't use this exactly. What I use is slightly more complicated. I've found that I often can tell if it's a rate limiting problem if I look at the exception. So, let's pass it a predicate to check.

    (defn exponential-backoff [time rate max p? f]
      (if (>= time max) ;; we're over budget, just call f
        (f)
        (try
          (f)
          (catch Throwable t
            (if (p? t)
              (do
                (Thread/sleep time)
                (exponential-backoff (* time rate) rate max p? f))
              (throw t))))))

This one only recurses if the predicate returns true on the exception. Let's service mentions "queue capacity" in the body of the HTTP response when it's too busy:

    (exponential-backoff 1000 2 10000
      (fn [t] ;; the predicate
        (and (instance? clojure.lang.ExceptionInfo t)
             (re-find #"queue capacity" (:error (ex-data t)))))
      #(http/get "http://rate-limited.com/resource"
                 {:socket-timeout 1000
                  :conn-timeout   1000}))

You can be more selective about your backoff.

A Macro

Well, here's an example macro. It's got a bunch of defaults.

    (defmacro try-backoff [[time rate max p?] & body]
      `(exponential-backoff (or ~time 1000) ;; defaults!
                            (or ~rate 2)
                            (or ~max 10000)
                            (or ~p? (constantly true))
                            (fn [] ~@body)))

Here's how you use it:

    (try-backoff []
      (println "trying!")
      (do-some-stuff))

Also, add it to your Clojure Emacs config for better formatting, because this one wants the args on the firs t line:

    (put-clojure-indent 'try-backoff 1)

This tells Emacs to make the second argument ((println "trying!")) one indentation in, instead of directly under the first ([]).

Warning

All of the try3 warnings apply. The stuff you're doing inside needs to be idempotent!

Conclusion

This pattern is another cool, reusable component to help build reliability into a distributed system. Small, intermittent failures are pervasive. And a common form of error is a server being too busy. Being able to handle this type of error quickly and systematically is going to make your life easier.

Though Clojure does not have specific solutions to distributed systems problems, coding them up is short and straightforward. If you're interested in learning Clojure, I suggest you check out LispCast Introduction to Clojure. It's a video course that uses animation, storytelling, and exercises to install Clojure into your brain.


  1. meaning the server can only handle so many jobs at once, and the behavior is undefined at that point