How to prepare your system to scale (OR why auto-scaling is not enough)
Eynav Mass, Wed Apr 21 2021, 10 min
Scale. We tend to dismiss it as just a buzzword, until we reach the point when it has taken over our engineering team’s nights, weekends and thoughts. Scaling is usually the negative effect of positive business growth. And as engineers, our goal is to be enablers of business growth, so we need to prepare our systems to accommodate increased loads—without issue or regression.
The question is, how does one best prepare for scale? How can we proactively prepare our systems to handle a wishful, yet expected, future load?
There are multiple, in fact endless, directions to go in when it comes to architectural changes to a system, especially when a system is large with many components. Since scaling is part of business growth, a common approach is to outline a plan for scaling, keeping the expected growth of an engineering team in mind. But if resources are short, it can be hard to address multiple adjustment directions at once, and so one must choose what to focus on, or rather which changes to “bet” on, in order to “win” the upcoming load challenge.
In this post I will share some considerations that are not often discussed, yet are not trivial when you begin to think about preparing for scaling. These considerations are taken from our own experiences at Oribi, where we have confronted scaling of all kinds.
What does scaling look like ?
What does scaling look like? Usually when you think of scaling, you imagine a graph with a peaceful line, but at some point there is a sudden lurch upwards, a vast jump in the line. This is a scale peak. A scale peak results from a major change at a business, and can be driven by one large customer, or by a sudden business success trend that drives up usage.
[An example of a scale peak we had at Oribi]
But there is another shape to scaling whereby the system experiences a slowly growing load, not sudden, but rather a solidly increasing pace of usage. This kind of scaling can be tricky, and is quite similar to the boiling frog experiment. In this situation, your decision-making sensors aren’t triggered because of the steady increase, and eventually you can easily find yourself at 90% system load without even noticing it.
[An example of steadily increasing usage that eventually resulted in a load at Oribi]
“Get ready for tomorrow today” (Jvongard)
Well..that's easy to say, but hard to implement because:
Your system is already large and/or complicated, and making big changes takes time.
Or conversely, your system is still small or immature, and thinking “big” may feel irrelevant at your current stage of system growth. Perhaps you tend to use YAGNI and ignore the internal voice that calls for making changes to help “tomorrow’s” system, rather than “today’s”.
You are short on resources. Having been in the tech industry for more than 15 years now, I must say that I have never had the feeling of being wealthy with engineering resources. That’s just how it is, there will never be a peaceful time to plan ahead for scaling.
You lack knowledge. You have the will to change and adjust the system, but you are not sure how.
Oribi and scaling
As mentioned, at Oribi we have experienced many different types of scaling. We have always aimed to be well prepared for scaling, yet we have still had production surprises from time to time. Looking back on the surprises, most of the time they happened because we had taken preparation actions that were in some way trivial or that were “quick wins.” The most valuable actions have been those that have gone deeper into our system infrastructure, and as such cannot be taken once production is already in the midst of a scaling crisis.
There are three topics that you should think seriously about as you prepare for scaling and they definitely shouldn’t be dismissed with the “You ain't gonna need it" excuse: elasticity, batching, and starvation. Try to think about your system with respect to each topic, analyzing your system’s current status and considering how complicated it would be to improve upon it. The more complicated the changes are, the more you should consider them ahead of time, since their cost will grow exponentially once your load has become large.
Map your system. If at this point you do not have a decent diagram of your architecture—it’s a great step to start with.
Locate the components that can be easily or even automatically scaled out. These are the parts of the system that you should be less worried about. Such auto-scaling components should (theoretically) act as expected while scaling, and if they don’t auto-scale for some reason, you should be able to implement a manual workaround to scale them out.
The parts of the system that you should really be worried about are the nonelastic components. Usually these are databases or data stores, message queues, network components or even badly planned services in your architecture. When trying to find these components in your system, you should consider the ability of each one to not only scale with compute resources (CPU), but also with memory, storage, I/O, concurrency and the network.
For example, during one of our last scale peaks, eventually our system was slowed by our message queue. We use Apache Kafka, and we had assumed that we had it checked and ready for scaling. We had practiced adding brokers and rebalancing the cluster successfully, and we had also prepared a scaling playbook.
But since reality is always full of surprises, our scaling situation resulted in a high peak that brought our Kafka brokers to 85% of CPU load. We duly followed our scaling playbook, and added brokers to scale the Kafka cluster out. The next action in our playbook was to rebalance the data among all the Kafka brokers, including the newly added ones, in order to equally spread the load between the message queues. But the issue was, that with most of the message queues, such a rebalancing, would require more CPU resources. And since our Kafka cluster was already at 85% CPU usage— the rebalancing would bring the cluster to 100% load and accordingly to a major failure.
What next? Oh no. At this point the cluster was “choking” but we still had to act by scaling it out as planned. However, I skip the full story of how we eventually solved this nearly “deadlocked” situation, since the topic deserves its own dedicated post :)
The best approach, of course, is to prevent such situations by adequately providing for them with the capabilities of your system architecture. Unfortunately (as in our story), you may find yourself short on resources and so will need to plan to use workarounds instead of adding more system capability. But be careful with counting on such workarounds as you map your architecture’s elasticity. If you assume that a component can scale out, take into account the limitations of the scaling-out action itself, and list that among your planned actions.
I can clearly see how batching could be considered a luxury when planning architectural changes to support scaling. Because if you don’t have batching processes implemented from an early stage of your architecture, it may not be so straightforward to add them. As a result, batching tends to be the area that gets neglected, particularly when planning for short-term rather than long-term (remember YAGNI?).
Still, the cost and complication of changing your system architecture to work with batch processes can pay off. To benefit from a good batching process, you must plan it well ahead. Otherwise, as mentioned, batching should be the very last magic trick to pull from your hat when handling a scaling crisis.
If your system has parts that work in a streaming mode, enable them to also function via batch processing. For example, databases with large numbers of writes, can be good candidates to change from writing via streaming to writing via batch.
Why? When loads are increasing on your system, streaming is harder to control and is more complicated to limit. Back to our database writes example above, scaling database I/O is often not possible and may require downtime. On the other hand, writing in batches can help you control the load on the database and protect it from total failure or unavailability.
Have you seen the movie “The Lion King”? Let me tell you that most software systems are not a reflection of its “circle of life” theory. By that I mean that the average system does not act like a “food chain” where each species (== “user”) has enough resources for itself. Rather the common system architecture has several parts that are at risk of causing starvation, a situation where one or just a few users eventually command most of the system’s resources.
And when starvation is combined with scale, when the system is loaded, the parts that are at risk of starvation will be the first to get blocked by those few users. For example, if your application queries a database directly, a user action on the application dashboard that requires a large amount of data, or that requires a very heavy computation, can trigger a high load on the database directly. Adding such a load to the already scaling load will result in insufficient resources and an unhealthy state that you cannot control.
If you are in the early stages of business, a starvation situation can sound like a faraway possibility. However, you should challenge your mind to imagine the effects of scale on your system, leaving the starvation possibility on the table. You can never predict when it will really hit your system.
Your scaling weapon: observability
Personally, I don't agree with the mantra “you can never have enough monitoring.” In reality, you don’t need monitoring on your whole system, rather you need the best sensors to detect system health. Observability is not about having a notion of all the changes in a system, it is about alerting you when the system has reached a bad state that you can and should act upon.
As mentioned earlier, one of the reasons that it is so hard to prepare for scale is lack of resources, a reality that can sometimes even defeat the best practices that I have covered in this post. This makes sense, and is common. But in lieu of making architectural changes that you can’t “afford”—you can at least insist on adding observability tools and processes, so that you can act in time when situations do arise.
Looking back on all the scaling experience we have had at Oribi, I actually take back my earlier statement that “scale is usually the negative effect of positive business growth.” In fact, with a good combination of tools and teamwork, scale can result in a better system, more stable user flows, an improved engineering culture, and extended knowledge.
So as your system evolves, try to forensically examine each major scaling event using the points above: elasticity, batching, and starvation. Count those scaling events as positive experiences that help you build a great process, and with the right observability methods enables you to continuously plan ahead for your next scaling experience.