How to Retry with Class

Highly distributed applications that consist of lots of small services talking among themselves are getting more and more popular, and that, in my opinion, is a good thing. But this architectural style brings with it a new class of problems that are less common in monolithic applications. Consider what happens when a service needs to send a request to another service, and this second service happens to be temporarily offline, or too busy to respond. If one little service goes offline at the wrong time, that can create a domino effect that can, potentially, take your entire application down.

In this article I'm going to show you techniques that can give your application some degree of tolerance for failures in dependent services. The basic concept is simple: we make the assumption that in most cases these failures are transient, so then when an operation fails, we just repeat it a few times, until it hopefully succeeds. Sounds easy, right? But as with most things, the devil is in the details, so keep reading if you want to learn how to implement a robust retry strategy.

For the purposes of this article, let's assume we have a distributed application built with microservices, where each microservice exposes a REST API. One of the microservices provides user services to the rest of the system, and one of the important functions this microservice performs is to validate the tokens sent by clients along with their requests. Any microservices that accept requests from external clients need to pass the tokens they receive to the user service at the /users/me URL so that they are validated. When a token is determined to be valid, the user service returns the user resource that owns that token to the client. When the token cannot be validated, an appropriate HTTP response is returned, mostly likely a 401 response to indicate the token is invalid.

The following Python function is a simple wrapper that makes the request to the user service using requests:

def get_user_from_token(token):
    """Authenticate the user. Raises HTTPError on error."""
    r = requests.get(USER_SERVICE_URL + '/users/me',
                     headers={'Authorization': 'Bearer ' + token})
    r.raise_for_status()
    return r.json()['user']

Microservices that need to validate tokens can simply invoke the function above to get the user that corresponds to the token they were given. If the token is invalid, then there's going to be an exception raised, and that will prevent the request from running. If you were to write your microservices in Flask (an excellent choice, if I'm allowed to say), you could put the token validation in a before_request handler:

@app.before_request
def before_request():
    auth = request.headers['Authorization'].split()
    if auth[0] != 'Bearer':
        abort(401)
    g.user = get_user_from_token(auth[1])

Here I extract the token from the Authorization header, and if the token gets validated, I leave the user it belongs to stored in g.user, so that the API endpoint can have access to it.

The Naive Approach to Retries

So let's now consider what happens when the user service needs to be upgraded to a new release. Because this is a distributed system, the user service can be upgraded individually, while the other microservices continue to run normally. Let's assume it takes about ten seconds for the user service to stop, load new code, make any necessary database upgrades and restart. Ten seconds is a pretty low upgrade downtime by most standards, but still, unless specific measures are taken, the entire system will be unable to validate tokens and thus will be rejecting all requests that are sent by clients during those ten seconds that the upgrade procedure lasts.

Luckily, there is a much better solution. Instead of failing requests that cannot be authenticated, the server can stall those requests for some time, hoping that whatever internal situation that caused the user service to fail will get resolved soon. This is actually the best solution from both sides. For the client there is no visible failure, just a slight delay. And for us on the server side, we are still able to upgrade or perform maintenance on services as needed, without fear of affecting clients that are actively using our services.

So how do we make the application wait for a service that is not responding? It's really simple, if we get a failure from the service, we literally wait, or sleep for bit, then repeat the request. We can repeat the request several times, hoping it will eventually succeed. The retry logic can be incorporated into the get_user_from_token() function:

# this are the HTTP status codes that we are going to retry
# 429 - too many requests (rate limited)
# 502 - bad gateway
# 503 - service unavailable
RETRY_CODES = [429, 502, 503]

def _get_user_from_token(token):
    r = requests.get(USER_SERVICE_URL + '/users/me',
                     headers={'Authorization': 'Bearer ' + token})
    r.raise_for_status()
    return r.json()['user']

def get_user_from_token(token):
    """Authenticate the user. Raises HTTPError on error."""
    # run the request, and catch HTTP errors
    try:
        return _get_user_from_token(token)
    except requests.HTTPError as exc:
        # if the error is not retryable, re-raise the exception
        if exc.response.status_code not in RETRY_CODES:
            raise exc
    # retry the request up to 10 times
    for attempt in range(10):
        time.sleep(1)  # wait a bit between retries
        try:
            return _get_user_from_token(token)
        except requests.HTTPError as exc2:
            # once again, if the error is not retryable, re-raise
            # else, stay in the loop
            if exc2.response.status_code not in RETRY_CODES:
                raise exc2
    # if we got out of the loop that means all retries failed
    # in that case we give up, and re-raise the original error
    raise exc

In this second version, I moved the actual request logic into a _get_user_from_token() auxiliary function (note the underscore prefix), and then in the main get_user_from_token() function I implemented a retry loop that will re-issue the authentication request up to 10 times if the initial request failed, waiting a second between attempts. If any of the retries succeed, then the caller of the function will not even know there were failures, which is great, as it keeps the handling of the error localized to this function.

The comments in the code should help you understand all the details, but I think it is interesting to note that I'm not blindly retrying all errors, and instead I selectively retry requests that come back with just a few whitelisted status codes. REST APIs return status codes that are indicative of all sorts of different results, and some of those do not really make sense to retry. Codes in the 200-299 range are all success codes, the 300-399 codes are redirects, 400-499 are client errors, and 500-599 are server errors. From all of these, i selected the three that I consider likely to succeed on a retry. I really do not want to waste time retrying errors that have no hope of ever succeeding. In this example I'm going to retry the status codes 429 (which results from rate limiting), and the 502 and 503 pair, which are both common responses from proxy servers when the target service is offline (for example while undergoing an upgrade). Obviously the errors that deserve a retry can vary from application to application, so these need to be evaluated for each project.

With this improvement we have made our code is much more tolerant of failures on the dependent service, and we've done it with a single, localized change. The rest of the application, and most importantly our clients, are completely unaware of this retry logic which transparently benefits them.

You may think we have reached a happy ending for this article, but in fact, there are many ways to improve the retry mechanism I just presented. We are just getting started!

Retry Churn

Let's say this application we are adding retries to is a fairly large application, with lots of clients. For the sake of an example, let's assume that on average, the user service receives 100 requests per second, but has been provisioned to be able to handle up to 200 if needed. Using the code from the previous section, let's simulate what the ten seconds of down time look like from the point of view of number of requests flying through the system and how many succeed and fail:

In this chart, blue indicates successful requests, while red indicates failures which will need to be retried later. The service went offline for ten seconds starting at the zero time mark, so between 0 and 10 all requests are red. As you can see, this looks pretty bad. By the time the service is getting ready to go back online the number of requests and retries snowballed into an amazing 1000 requests per second, but new requests kept coming at the normal rate, so in the first second after coming back online, the service had 1100 requests in its queue, and that caused it to be pegged at its maximum of serving 200 requests per second for another ten seconds to catch up with the backlog, while constantly receiving new request at the average 100 per second.

So while having a retry strategy helps with the robustness of the application, our current solution leaves a lot to be desired in terms of abuse of our limited resources. In the next section we'll look at a different retry strategy, which has the cool sounding name of exponential backoff.

The Exponential Backoff Algorithm

An obvious solution to the problem of having too many retries is to not be as aggressive with them. I could change the sleep statement from one second to, say, five seconds, and that will drastically reduce the request traffic during the upgrade. But while five seconds might be reasonable for our fictional user service, it may stil be too much for another service that has longer downtimes. Basically I would be forced to fine tune the retry loop individually for each service, according to what I know about how long and how often the service is expected to not be able to answer requests. As you can imagine, this can be hard to get right for every service.

The alternative solution that is typically used is based on an algorithm called Exponential Backoff. The idea is that the sleep amount for each successive retry is increased by some factor, so that the longer the target service is offline, the more spaced out the retries get. Going back to our example user service, when the service fails to respond, I can sleep for a second like I did before, but if I the retry fails, then I sleep for two seconds before a new attempt. And if I get yet another failure, then I sleep four seconds before the next, and so on.

A formula commonly used to calculate the sleep time for a given attempt is the following:

sleep_time = (2 ^ attempt_number) * base_sleep_time

The 2 in this formula is the factor, and can be changed to another number if necessary, so for example, using a 3 will cause the sleep time to triple with each retry instead of double. Another variant seen in some implementations is to add a maximum sleep time, to make sure delays do not get too long when there are several retries:

sleep_time = min((2 ^ attempt_number) * base_sleep_time, maximum_sleep_time)

To introduce exponential backoff to the retry loop I presented earlier, I just need to introduce one of the formulas above to calculate the sleep time:

base_sleep_time = 1

def get_user_from_token(token):
    # ...
    for attempt in range(10):
        time.sleep(pow(2, attempt) * base_sleep_time)
        # ...

Let's see how the chart from the previous section changes if we implement exponential backoff retries:

And this looks much better. At the worst time, the request backlog gets to 400 requests, a big difference compared to the previous 1100. It's also interesting to note that while in the first case the service recovered at around the 20 second mark, in this case it is less clear to decide when the recovery is complete, since the varying retry iterations force retries to spread over a longer period of time. But we can clearly see that with fixed retries the service had to run at its maximum capacity for 10 seconds to catch up, and with exponential backoff the service started to get some air to breathe right around the 18 second mark, two seconds earlier.

So using exponential backoff is overall better, even though some requests will take longer to complete. But did you notice how blocky the chart is? That is a common problem with this algorithm, it tends to group retries around certain times instead of making them happen evenly. This is even more noticeable at time 14 seconds, where the number of requests handled dipped a bit below the 200 cap because there were not enough retries at that time to take advantage of all the available resources.

Adding Some Jitter

A very good option to help distribute retries better is to add some randomness to the sleep times. One common solution is to add a random component to the sleep time determined by the exponential backoff algorithm. In the following example, the backoff time is increased by up to 25% randomly.

from random import random

base_sleep_time = 1

def get_user_from_token(token):
    # ...
    for attempt in range(10):
        time.sleep(pow(2, attempt) * base_sleep_time * (1 + random() / 4))
        # ...

You can see in the chart below that some of the blockyness of the exponential backoff is, in fact, smoothed out with this technique:

Another, even simpler approach is to use the sleep time obtained from the backoff algorithm as a maximum, and randomize the sleep between zero and that time:

from random import random

base_sleep_time = 1

def get_user_from_token(token):
    # ...
    for attempt in range(10):
        time.sleep(pow(2, attempt) * base_sleep_time * random())
        # ...

This may seem counterintuitive, but as you can see in the chart below, this technique makes for even smoother curves, at the cost of some increase in the request accumulation during the down time:

Which of the two sleep randomizer functions is better? That is really not easy to say, and in fact, choosing between the two options I presented isn't even fair, since there are many more ways to randomize the sleep time, also likely to yield relatively similar curves.

Conclusion

I hope you liked this discussion on retries and found it interesting. If you are wondering how I generated the data for the charts that I presented in this article, here is a small Python script I wrote that runs simulations with the different algorithms: https://gist.github.com/miguelgrinberg/ec97989c0569a873a3dca95882cae196. You are welcome to play with it and find other possible retry solutions.

Do you have other methods to deal with retries? Let me know below in the comments!

#1 TomS said 2016-11-08T17:30:38Z

Thank you for your post! Just in time:)! I have a question - maybe you could help. I've been struggling with somehow similar case:

My Flask webapp (2 processes and 5 threads) within one application delivers a few microservices (one microservice checks external resource using 'requests'). In case of temporary unavailability (when external resource is not available), just like in the example, loop with time.sleep is used. But it means that this whole webapp is blocked when number of clients > 10 (because time.sleep blocks execution of the code, right?). Would the pattern from the post be applicable in mentioned case? Or different approach should be considered to avoid blocks (and handle [of course in case of unavailability -> without success] as many clients as possible without timeouts)?

Thank you!
#2 Miguel Grinberg said 2016-11-08T19:13:00Z

@TomS: If you have this external resource that can be a sort of bottleneck for your entire application when it goes offline, then I'd say it is problematic to contact it while handling a request from a client. A better solution is to make your public facing APIs asynchronous, so that they don't block waiting for this dependency to respond. If the client sends an asynchronous request that returns immediately, you can then put the code that contacts this external resource in a background task, and then you are free to implement the retry solutions presented in this article.
#3 Pete Forman said 2016-11-15T09:04:22Z

I had occasion to implement a retry strategy. In my case it was because the resources I wanted were via an ad hoc tunnel which took a few seconds to set up with no easy way to monitor when it was ready. I started rolling my own retries before coming across exactly what I needed in requests.

import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

[...]
self.session = requests.Session()
retries = Retry(connect=8, backoff_factor=0.125) # 32 s total
self.session.mount('http://', HTTPAdapter(max_retries=retries))
#4 TomS said 2016-11-15T13:54:06Z

@Miguel Grinberg: Ok, thank you for your answer. So the example ( https://www.youtube.com/watch?v=tdIIJuPh3SI&feature=youtu.be&t=6427 ) from your presentation to handle asynchronous request using Celery (with retry solutions inside the task) would be proper approach?
Thank you!
#5 Miguel Grinberg said 2016-11-15T17:42:56Z

@Pete: Right, this isn't actually requests, but one of its dependencies, urllib3. I have used it as well, and also recommend it. The Retry class has many options, for example, it allows to selectively retry certain status codes, as I do with my own implementation in this article. Documentation: http://urllib3.readthedocs.io/en/latest/reference/urllib3.util.html#urllib3.util.retry.Retry.
#6 Miguel Grinberg said 2016-11-15T17:48:51Z

@TomS: Right, using Celery would be a good way to offload those potentially long and blocking operations to a background job, where you can add retries.
#7 Vincent said 2016-12-28T01:09:25Z

Miguel,
Thanks for your awesome site. I've been everywhere and nowhere had all the info I needed to start working with Flask. My project is now almost ready for production thanks to you! I just got your book for Christmas to top it all off. Nice work.
#8 Brandon Griffin said 2017-06-09T18:49:28Z

Thanks for writing this! It helped solidify some things in my brain, and the graphs are really helpful!