Production Support Caseload; From Burden to Growth Opportunity
Eynav Mass, Sun Jan 17 2021, 12 min
Being technical people, we see everything through engineering binoculars: we usually consider the effects of increased scale mainly with respect to our system architectures. We figure out how infrastructure can handle the load better, we analyze the components that are acting as bottlenecks, and so on. But if we zoom out a bit, we see that increased scale affects many more parts of an organization: when incoming traffic increases, or the daily usage numbers rise, it’s a sign that the number of users has grown as well.
And growing the number of users results at some point, in an increased number of production support cases. Usually there are support tiers at an organization to help reduce the load, but some of it still makes its way to the tier of the engineering team.
Several months ago, we started “feeling” a disruptive number of support cases, all requiring the help and involvement of the R&D team. We had a defined process for dealing with production support cases that had worked well up until then, but this time we struggled to digest the increasing number of cases. For a while we thought we were experiencing peak usage and that it would soon decrease. But after two challenging sprints where we missed our goals, we came to understand that the load was here to stay.
Sometimes I hear engineers complain about production support issues, since fixing an edge-case bug or helping with an uncommon usage of an app are not as exciting as redesigning a system or implementing a new shiny feature. At Oribi, though, we see the situation differently: we know that our users are an integral part of our product’s success. When we observe an increased load of user requests, we happily conclude that we now have more users that are using our product. It’s even better than that—we have more users for our product that are engaged and that aim to succeed with it.
The goal with an extra large number of production support cases is to manage them right, and manage them continuously. In this post I will share how we transformed the support process from a disruption into a learning experience and cross-organizational improvement opportunity.
“Data beats emotions.” (Sean Rad)
The first step we took was to move emotions out and bring analytical thinking in. This should be an obvious action for engineers since most of our thinking is analytically based. But we are human after all, and sometimes it’s easier to waste energy on frustration rather than take actionable steps.
At the first stage, we decided on the period of time that was relevant for us to analyze, and we gathered together the incoming cases along with their various parameters. This was an easy step since we did have an organized process and the information for each case was well documented.
When we had an organized bank of case information, it was a good time to stop and ask ourselves—what are the main organizational areas we would like to analyze with this project? It was important to choose at least two focus areas that affected the production support of the company (and customers). So, we decided to focus on the following:
Support process: We wanted to understand which parts of the process worked well and which required improvement in order to be effective. This may sound like a non-engineering related topic, but as R&D is the “last stop” in the support process flow, we wanted to minimize the number of cases sent to us. A good way of minimizing the R&D part of the flow is to ensure that the prior tiers work well and effectively. We also wanted to learn about the escalation flow and how cases were moving from one tier to another.
Product: We aspire to have a state-of-the-art product, which is by all means a matter of an excellent user experience: ease of use, great performance, beautiful UI, helpful product flows and more. By examining the product aspects of the increased caseload, our goal was to make sure that our product behavior was aligned with the required user experience. On topics where the product was not aligned, we wanted to make sure that our roadmap plans matched the desired changes and addressed customers’ “pain.” This may also seem to be a non-engineering topic, and it is. The engineering team should focus on the product roadmap. If they spend a lot of time on support issues that differ from the current product goals this is an important red flag to follow and analyze.
Now we had a respectable number of support cases along with clear focus areas we wanted to analyze. The next step was to decide on a set of criteria, which would allow us to analyze each case according to the focus areas (support process and product). Each criterion was a question we wanted to ask and learn from.
Here are some examples of criteria that we decided on:
Criterion questions related to the support process:
Was the case resolved by sharing information with the user, or did it require a code/system change?
Was the case investigated immediately or did it require more data from the prior support tier?
Did the priority given to the case reflect the urgency of it? For example, if the priority was “high,” was the user blocked from using the product?
How much time was spent on the case by each tier?
How much time did it take until a full resolution?
Criterion questions related to the product:
Which main feature was related to the support case?
Was the reported issue part of a wider issue? If so, what was the issue?
What was the urgency of the case? Did it block the user from using the product, did it block the user from using some features of the product, or did it affect the user experience but didn’t block the product?.
Was the reported case reported more than once?
Was the case resolved by sharing information with the user, or did it require a code/system change? (This question was asked in relation to the support process, but it had value for the product as well, since it helped us learn where we could invest in documenting features or in improving the product’s ease of use.)
What was next?
We used a scoring method, according to the defined focus areas: product score and support process score. The product score represented the severity of the cases; if the score was higher the case severity was higher and required more consideration. The support process score represented the effectiveness of our process for handling case flow; if the score was high, the process had room for improvement.
Survey time :)
Now that we had the methodology, it was time to fill in the data: for every support case from the previous three months, we answered the selected question and calculated a score. It might seem like this was a time-wasting task, but it was distributed throughout the team and was a pleasant activity.
Our goal was to isolate cases with a high product score and with a high support process score—and to derive conclusions accordingly. The score was up to the team to define; they had to think deeply about each possible answer to the question, and also consider how the scoring of individual answers would affect the overall score.
Example of scoring:
In addition to the scored questions, we also had open questions, which provided us with a “big picture” view of the product and process. For example, they helped us to summarize the number of cases per related feature.
From data to decisions
Now that the data was analyzed and aggregated into conclusions, it was time to turn the conclusions into actions. This was my favorite step, drawing the lines between the dots, zooming out to see the bigger picture.
So for this step we considered all of the collected data and scores. Analyzing the cases in this way created a clear picture of the “pain” related to the increased support caseload: how it affected customers, the Engineering team, the Product team, the Sales team and the Customer Service department. Now we were able to ask several valuable questions, like:
Which are the most frequent cases? We increased the priority of these cases and translated them into tasks for the relevant team.
Which are the most severe case types? The cases with the highest scores were those that totally blocked customers from using the product. We added monitoring to detect these types of cases before the customer senses them. And where case resolution required engineering, we added it to our short-term plans.
Which case-handling support flow steps could we improve? These were the cases with the highest support flow scores. For example, we learned where we could invest more time on documentation to avoid support cases that were simply information requests.
When a case could not be handled and required more information, we learned that we should improve the escalation process from tier to tier. Such a case required more time, more resources, and more annoyance to the customer in the form of repeated questions. These were all avoided, though, with better data collection at the first stage by the initial support tier.
We categorized the cases according to their related product features. This helped us to understand how they were distributed among our product flow use cases. It also helped us to understand usage trends of the product better, and helped us determine the location of major pain points.
Issues that mostly harmed the user experience but were complicated to solve were turned into “Big Rocks” on our roadmap plan and were given a clear timeline for resolution, one that we could share with the customer as well.
Now that the data was aggregated and analyzed, it was time to translate the information into actionable insights for each department in the organization: Engineering, Product, Customer Service and Sales.
As I am a true believer in teamwork, I didn’t create any actionable items ahead of time. Instead, I summarized the conclusions that pertained to each department and held several brainstorming sessions to discuss the results. Thus decisions were made together, in a way that aligned all teams on the improvements.
“Good decisions come from experience. Experience comes from making bad decisions.” (Mark Twain)
If you have read all the way to here you're probably wondering how it’s going for us now. Well, we have made great progress but there is still more room for improvement. The process I have described is not a one-time effort, but rather an ongoing observational task that we have adopted and that we run every three months. When all is running smoothly we skip the analysis and set a checkpoint to review the process again three months later. How do we know that “all is running smoothly?” As I quoted earlier—“data beats emotions”—thus a good way of concluding the process that I have described is by setting clear KPIs to measure ongoing effectiveness. The KPIs should reflect the lessons learned from the data analysis since they must be set to measure the goals that we are trying to achieve.
Here are several examples of KPIs that we have set up:
Time spent on a case from the time it is opened until it is resolved
Time spent on a case by each support tier
Number of cases that are closed by sharing information/documentation
Cases that are opened on regression bugs
Cases that are moved to an earlier tier due to lack of information
Number of cases related to performance issues
Number of cases with priority marked “high”
In addition to the KPIs, we have defined a cross-organizational SLA for the production support flow. You may be wondering why we didn’t have this defined earlier. The answer is that we did have an informal desired SLA for support flow, but it was hard to follow and too fluid to measure. By understanding that we needed a well-defined SLA, we encouraged the company’s teams to rethink the support flow—together. The result was a better, measurable SLA that everyone believes in and is engaged with.
Let’s review the whole process we underwent to address the increased scale of our production support caseload:
We started with data, collecting information covering around three months of production support cases. Then we set a methodology for analyzing the cases, with the goal of understanding the general support caseload from various vantage points. In order to gain a sense of the main barriers, we defined a helpful scoring mechanism to analyze the cases. At that point we added up all of the scoring and calculated data, , drew conclusions, and then encouraged the organization to take actionable steps to improve each part. To ensure that we hit our goals, we agreed on measurable KPIs (which are revisited periodically).
The main takeaway from this post is to always “listen” carefully for disruptions to an engineering team’s work, especially when the source is production support cases. Increased support caseload isn’t felt just by the engineering team but rather by the whole company. Thus fixes for the problem should be accompanied with relevant plans for supporting the company’s growing scale.
With an organized cross-team process for analyzing your production support caseload, you can learn valuable information related to product, engineering, architecture, customer success methods and sales flows. The goal is to translate your customers’ problems into a team-wide improvement process. Once you sense these signals, don’t lose control and just follow—take the reins—it’s your opportunity to make a real impact.