Note: this article is part 4 of a series called Accelerated Velocity. This part can be read stand-alone, but I recommend that you read the earlier parts so as to have the overall context.
“If a product or system chokes and it’s not being monitored, will anyone notice?” Unlike the classic thought experiment, this tech version has a clear answer: yes. Users will notice, customers will notice, and eventually your whole business will notice.
No-one wants their first sign of trouble to be customer complaints or a downturn in the business, so smart teams invest in developing “situational awareness.” What’s that? Simple – situation awareness is the result of having access to the tools, data and information needed to understand and act on all of the moving factors relating to the “situation.” This term is often used in the context of crisis situations or other fast-paced, high-risk endeavors, but it applies to business and network operations as well.
Product development teams most definitely need situational awareness. The product managers and development leads need to know what their users are doing and how their systems are performing in order to make wise decisions – for example, should the next iteration focus on features, scale or stability. Sadly, these same product teams often see the tracking and monitoring that is needed for developing situational awareness as “nice-to-have’s” or something to be added when the mythical “someday” arrives.
The result? Users having good or bad experiences and no-one knowing either way. Product strategy decisions being made on individual bias, intuition and incomplete snippets of information. Not good.
Sun Tzu put it succinctly:
“If you know neither the enemy nor yourself, you will succumb in every battle.”
Situational awareness is a huge topic, so in this series I’m going to limit my focus to data collection (tracking and monitoring) and insights (analytics and visualization) at the product team level. For the purposes of this series I’ll define ”tracking” as the data and tools that show what users/customers are doing and “monitoring” as the data and tools that focus on systems stability are performance. Likewise I’ll use “analytics” to refer to tools that facilitate the conversion of data into usable intelligence and “visualization” as the tools for making that intelligence available to the right people at the right time. I’ll cover monitoring in this article and tracking in a later article.
At Bonial in 2014 there was a feeling that things were fine – the software systems seemed to be reasonably stable and the users appeared happy. Revenue was strong and the few broad indicators viewed by management seemed healthy. Why worry?
From a system stability and product evolution perspective it turns out there was plenty of reason to worry. While some system-level monitoring was in place, there was little visibility into application performance, product availability or user experience. Likewise our behavioral tracking was essentially limited to billing events and aggregated results in Google Analytics. Perhaps most concerning: one of the primary metrics we had for feature success or failure was app store ratings. Hmmm.
I wasn’t comfortable with this state of affairs. I decided to start improving situational awareness around system health so I worked with Julius, our head of operations, to lay out a plan of attack. We already had Icinga running at the system level as well as DataDog and Site24x7 running on a few applications – but they didn’t consistently answer the most fundamental question: “are our users having a good experience?”
So we took some simple steps like adding new data collectors at critical points in the application stack. Since full situation awareness requires that the insights be available to the right people at the right time, we also installed large screens around the office that showed a realtime stream of the most important metrics. And then we looked at them (a surprisingly challenging final step).
The initial results weren’t pretty. With additional visibility we discovered that the system was experiencing frequent degradations and outages. In addition, we were regularly killing our own systems by overloading them with massive online marketing campaigns (for which we coined the term: “Self Denial of Service” or SDoS). Our users were definitely not having the experience we wanted to provide.
(A funny side note: with the advent of monitoring and transparency, people started to ask: “why has the system become so unstable?”)
We had no choice but to respond aggressively. We set up more effective alerting schemes as well as processes for handling alerts and dealing with outages. Over time, we essentially set up a network operations center (NOC) with the primary responsibility of monitoring the systems and responding immediately to issues. Though exhausting for those in the NOC (thank you), it was incredibly effective. Eventually we transferred responsibility for incident detection and response to the teams (“you build it you run it”) who then carried the torch forward.
Over the better part of the next year we invested enormous effort into triaging the immediate issues and then making design and architecture changes to fix the underlying problems. This was very expensive as we tapped our best engineers for this mission. But over time daily issues became weekly became monthly. Disruptions became less frequent and planning could be done with reasonable confidence as to the availability of engineers. Monitoring shifted from being an early warning system to a tool for continuous improvement.
As the year went on the stable system freed up our engineers to work on new capabilities instead of responding to outages. This in turn became a massive contributor to our accelerated velocity. Subsequent years were much the same – with continued investment in both awareness and tools for response, we confidently set and measure aggressive SLAs. Our regular investment in this area massively reduced disruption. We would never have been able to get as fast as we are today had we not made this investment.
We’ve made a lot of progress in situational awareness around our systems, but we still have a long way to go. Despite the painful journey we’ve taken, it boggles my mind that some of our teams still push monitoring and tracking down the priority list in favor of “going fast”. And we still have blind spots in our monitoring and alerting that allow edge-case issues – some very painful – to remain undetected. But we learn and get better every time.
Some closing thoughts:
- Ensuring sufficient situational awareness must be your top priority. Teams can’t fix problems that they don’t know about.
- Monitoring is not an afterthought. SLAs and associated monitoring should be a required non-functional requirement (NFR) for every feature and project.
- Don’t allow pain to persist – if there’s a big problem, invest aggressively in fixing it now. If you don’t you’ll just compound the problem and demoralize your team.
- Lead by example. Know the system better than anyone else on the team.
In case you’re interested, here are some of the workhorses of our monitoring suite: