Large scale software systems have dependencies. And dependencies can (and will) fail for any number of reasons, sometimes due to transient network issues, sometimes due to simply being overloaded. For the former, we — as software engineers — can attempt to mask failures by resending and retrying our requests, increasing the odds that our dependency will successfully respond to our request. However, should we continue to retry our requests when our dependencies fall offline due to being overloaded? When systems are under pressure, our retries actually exacerbate the problem, potentially prolonging the outage and making it more difficult for the dependency to recover.
As such, we should design our client software to be good citizens and consider the following before implementing retry logic:
- Backing off retries exponentially
- Setting a maximum timeout on each request
- Adding jitter between for each retry
Overview
Earlier this month, millions of cloud computing customers were impacted by Amazon Web Service’s (AWS) outage in us-east-1 (Virginia) that lasted several hours. Among these customers included Alexa, Ring, Disney Plus; even Amazon (the retail side) failed to fulfill actually deliveries.
In general, after large scale events, Amazon publishes a post mortem, a document sharing their lessons learned to partners and customers. In this particular write up, we learn that the outage could’ve lasted much longer had their (software/network) clients misbehaved:
Our networking clients have well tested request back-off behaviors that are designed to allow our systems to recover from these sorts of congestion events
AWS
Retrying Requests
Cost of retries
We mention above that systems can mask dependency failures by retrying requests. However, adding retry logic come at a cost: potentially overburden an already overloaded system.
Exponential Retries, Timeouts, Jitter
“When failures are caused by overload, retries that increase load can make matters significantly worse.” This can lead to what’s known as load amplification. Imagine a system you are building sends that is is nested five layers deep. If each layer retries five times, then the total traffic amounts to 3125x retries!
Exponential Backoff
One common technique for reducing the amount of wasted work is by performing exponential retries. For example, you send a request a request at T0; you then wait 1 second, retry at T1 and if this request fails, send another at 2 second mark … then at 4 seconds, then at 8 seconds, and so on. In short, the requests are staggered.
Jitter
Although exponential retries are a great start, they are not a silver bullet — especially when clients (perhaps hundreds or thousands) exponentially retry at the same time. For this reason, each client should introduce a random jitter, adding a random delay, some variance so not all requests arrive at the server at the same time. For example, if each client adds a random jitter between 0 and 1 second, then one client might send their second request a 1.25, another client at 1.75, and so on.
Timeouts
Finally, after introducing exponential retries and adding random delay with jitter, we should set an upper bound on both 1) maximum request time and 2) maximum number of retries. Otherwise, clients sit there spinning indefinitely, wasting both time and resources.
Example: Hubspot Python binding (no timeout)
You may be surprised to find out a library you used does not implement any of the above techniques. While the libraries often allow the caller (you) to supply timeout values, the libraries often default to no maximum time outs, allowing an unbounded amount for the endpoint to a serve requests. For example, HubSpot’s Python binding that allows you to interact with HubSpot’s API and if you step through the source code, you’ll find that ultimately the underlying socket has no timeout configured. That means if HubSpot API slows down to a crawl, your client application can potentially sit there indefinitely.
Summary
Whether your client libraries interact with your backend or a third party service, you want the libraries to behave like good citizens, especially under situations when systems are overloaded. By leveraging exponential backoffs, random jitters, and maximum timeout settings, your clients can balance masking failures with limiting unnecessary work. And by setting maximum time out values, your client side libraries or applications can percolate time outs to the user, to avoid situations in which your applications sit there spinning indefinitely.
References
- Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region – https://aws.amazon.com/message/12721/)
- Timeouts, retries and backoff with jitter – https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/?did=ba_card&trk=ba_card)
- Exponential Backoff And Jitter | AWS Architecture Blog – https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
- socket — Low-level networking interface — Python 3.10.1 documentation – https://docs.python.org/3/library/socket.html?highlight=socket#socket.getdefaulttimeout
- setsockopt(3): set socket options – Linux man page – https://linux.die.net/man/3/setsockopt
- GitHub – HubSpot/hubspot-api-python: HubSpot API Python Client Libraries for V3 version of the API – https://github.com/HubSpot/hubspot-api-python
Get the Crossbill newsletter
Get our guides and resources delivered straight to your inbox.