How to Handle Failures in a Complex Microservices Architecture
Asaf Halili, Tue Dec 15 2020, 8 min
It first hit me when I had to spend an entire day (with my colleague, thanks Natasha!) patching messages because we had a bug in one of our services: we have to change the way we handle failures.
Handling failures is important, but it’s a must have when a product handles sensitive data (especially when it’s financial transactions). It also becomes exponentially harder in the world of microservices, since a failure can happen in any of the services and even in their dependencies.
In a nutshell, our unique “data-ingestion” flow is an event-driven flow that receives events, processes them and saves them to our data stores. We need to handle failures in those services to ensure the completeness and correctness of our data.
Diagram - Failure in a Service
There are many possible reasons for failures, from an outage of a third-party service (API, DB, etc.) to a hardware failure, to the “classic” software bug (after all, software developers are humans too, aren’t we? 🤔).
Bugs Meme
In this post, I’ll tell you about my journey to implement a solution for handling a large and diverse set of errors in the microservices world.

What’s All the Fuss About? Just Retry

“Retrying” is one of the go-to solutions when handling errors, all libraries and frameworks have built-in retrying capabilities: it has become a standard in the software industry. 
The message failed to send to the queue? No problem, let’s send it again. However, it’s not that straightforward to implement a reliable and robust retrying mechanism.
Retry Meme
Let’s list the required attributes of a reliable and robust retrying solution:
  • Smart - It should retry in a smart way. If we retry immediately, the failure might not be solved yet. In addition, some errors aren’t automatically recoverable and so require human intervention—we should also consider these.
  • Persistent - In the world of microservices and CI/CD, services can restart and a new version can be deployed at any time. If a service restarts after it has performed a failed request, the retry should be recovered after the service starts.
  • Customizable - It should retry on certain errors, but not on others, and we want to be able to configure the rules.
  • Pluggable - It must be easy and quick to add the solution to a new microservice.
Now that we have seen that retrying isn’t simple or straightforward, let’s see how other companies deal with the problem.

Should I Reinvent the Wheel?

No one wants (or has the time) to reinvent the wheel.
With that in mind, I started my research, trying to determine how other companies solve the problem. At first, I didn’t find much. Some companies use a simple retrying mechanism with some type of delay. 
Other companies have a backup to their persistence layers, so if their DB is down, they’ll write data to a text file instead, and later, manually or automatically, will save that data to the DB.
Another solution is to save the data to a data lake (which is essential regardless) in several locations in the flow, and if there is a severe software bug, the issue will be fixed first and then the data will be re-sent to the correct service.
None of these solutions satisfied our needs. We needed a solid solution that better answered the desired capabilities mentioned above.
So I continued to research, until I came across a great article by the Uber Engineering team: Building Reliable Reprocessing and Dead Letter Queues with Apache Kafka. In the article, the team describes how they manage the reprocessing of their data using Kafka topics and dead-letter queues. At first glance, it looked like Uber’s idea would fit many of our needs, so I started to design a solution based on it.
Before I dive into the design, however, I should explain our needs in more detail, and for that let me tell you a bit about our product and architecture.

A Bit About Oribi

Oribi’s product is a marketing analytics tool for websites. Basically you install our script on your website, and we collect a lot of data about your visitors.
Via our web UI, you can see the data, query it in different ways, and gain insights to improve your conversion rate.
We work in a microservices architecture, with Kubernetes as an orchestration platform. All of our services are Dockerized and most of them are written in Java and Spring Boot.
We have two main flows:
  • Ingestion flow - An Apache Kafka-based, event-driven flow that receives events, processes them, and saves them to our data stores.
  • Query flow - A Rest API-based flow that serves the data from our data stores to our application.
A failure handling mechanism is crucial to our ingestion flow services.
Now that you have an idea of what our architecture looks like, let’s get to the interesting part. 😎

Uber’s Idea

If you didn’t read the Uber article I mentioned above (and please do), I’ll try to sum up the main parts of their idea.
Let’s assume we have a service A, which receives events via HTTP calls, processes them and saves them to a DB. 
Service A
One day, during one of our peak hours, the DB is down, so service A is failing at saving data to it.
What now? First, our error handling solution should identify that there is an error. Then, it needs to grab the relevant context of the error (in this example, the handled HTTP request). 
After that, it should send that context to some persistence layer, for later re-execution of the service flow. Since at Oribi we’re heavy users of Kafka, we chose to use Kafka as our persistence layer for the solution (this means that in Kafka we’ll save the HTTP requests that failed to process).
In the case of service A, we’ll create the following Kafka topics:
  • service_a_topic_retry_1
  • service_a_topic_retry_2
  • ...
  • service_a_topic_dlq
When a failure occurs in one of the HTTP requests, the request context (request body, query params, etc.) will be sent as a Kafka message to the first retry Kafka topic. As for the re-execution part, a polling of the first retry topic is added and if that processing fails too, it will be sent to the second  topic and so forth, until it’s sent to the dlq (“dead-letter queue”) topic, for manual analysis. (The number of retry topics is arbitrary in this example, it can be any number).
Let’s see how the solution I described applies to service A:
Diagram - Services with Retrying Mechanism
The polling duration of the respective consumers is determined by the polling duration of the previous topic and a multiplier. The idea is to implement an exponential backoff, which increases the chances that the failure will already be solved by the next execution try.

Design and Implementation

As explained earlier, we’re looking to match a set of adjectives that are crucial for a reliable and robust retrying solution. Let’s tackle them one by one:
  • Smart - To answer this, we use exponential backoff retrying. We also have dead letter queue topics to allow for the manual analysis of errors.
  • Persistent - We use Apache Kafka as our persistence layer.
  • Customizable - The standard way to handle errors in Java is by using exceptions. We can control which errors should be retried and which shouldn’t by throwing exceptions for the errors that should be retried and by catching the others.
  • Pluggable - It’s pretty obvious that our solution needs to be a separate JAR that we’ll add as a dependency to our services. Regarding the integration, after a few discussions on the subject and since our Spring Boot setup is using annotations heavily, we decided to implement our own annotation for the purpose.
In short, I developed a JAR that implements a method annotation. Any of our services that add the JAR as a dependency can add the annotation on top of a method–configure two Beans and it’ll automatically use the retrying solution.
Let’s see an example:
public KafkaPersistencyConfig kafkaConfig() {
        return KafkaPersistencyConfig.builder()

public RetryingConfig retryingConfig() {
        return RetryingConfig.builder()

public void methodToRetryOnFailure(String arbitraryVariable) {
	throw new RuntimeException(“Exception Occurred”);
Voila, the retrying mechanism works.
How cool is that? :)

How It’s Done

Our implementation is using Spring AOP (or “Spring Aspect-Oriented Programming”) heavily. AOP is a cool idea as it enables us, the developers, to think about code execution flows from a different perspective.
In this case, it allowed us to easily implement a method annotation. When the code identifies that a function is using our annotation, it starts the whole mechanism, which means listening for any exceptions that are thrown from the function and starting consumers for each retrying topic.
When the annotated function executes, the annotation code listens to it and is ready to catch any exceptions thrown from it. In the case above, it received an exception, so it sends the function parameters to the first retry topic. When any of the consumers consume a message, it invokes the annotated function with the message content and in case it fails, it sends the message to the next retry topic, or eventually to the DLQ topic.
This implementation has some limitations:
  • Due to a limitation of Spring AOP, only public methods in Spring Beans that execute from other Beans are supported.
  • Only methods with a single parameter are supported.


After implementing the solution I’ve described, we’re much more confident now when our system has failures, the system requires fewer manual interventions when it fails, and most importantly, we sleep well at night!
I hope this post helped you by providing you with some new ideas for handling failures in your applications.
Feel free to comment below with any questions/thoughts :-)
We’re Hiring!
Oribi is growing,
come join our amazing team :)
Check out our Careers Page