Accelerated Velocity: Creating an Architectural Runway

Most startups are, by necessity and by design, minimalistic when it comes to feature development.  They build their delivery stack (web site or API), a few tools needed to manage delivery (control panel, CMS) and then race to market and scramble to meet customer requests.  Long term architecture thinking is often reduced to a few hasty sketches and technical debt mitigation is a luxury buried deep in the “someday” queue. 

At some point success catches up and the tech debt becomes really painful.  Engineers spend crazy amounts of time responding to production issues which they could have used to develop new capabilities.  New features take longer and longer to implement.  The system collapses under new load.  At this point tweaks won’t save the day.  An enterprise architecture strategy and runway is needed.

What is an architecture runway?  In short it’s a foundational set of capabilities aligned to the big picture architecture strategy that enable rapid development of new features.  (SAFe describes it well here.)  In plain english – it’s investing in foundational capabilities so features come faster.

The anchor of the architecture runway is, of course, the architecture itself.   I’m not going to wade into the dogmatic debate about “what is software architecture”; rather, I’ll simply state that a good architecture creates and maintains order and adaptability within a complex system.  The architecture itself should be guided by a strategy and long-term view on how the enterprise architecture will evolve to meet the needs of the business in a changing market and tech-space.   

In developing an architecture strategy and runway, architects should start with the current state. At the very least, create a simple diagram that gives context to everyone on the team as to what pieces and parts are in the system and how they play together.   Once the “as is” architecture is identified and documented, the architects can roll up their sleeves and develop the “to be” picture, identify the gaps between the two states, and then develop a strategy for moving towards the “to be”.  The strategy can be divided into discreet epics / projects, and construction of the runway can begin.

Bonial’s Architecture Runway

Success had caught up to Bonial in 2014.  Given the alternative I think everyone would agree that that’s the right problem to have, but it was a problem none-the-less.  The majority of the software was packaged into a single, huge executable called “Portal3,” which contained all of the business logic for the web sites, mobile APIs, content publishing system and a couple dozen batch jobs.  There were a few ancillary systems for online marketing and some assorted scripts, but they were largely “rogue” projects which didn’t add to the overall enterprise coherence.  While this satisfied the immediate needs and had fueled impressive early growth and business success, it wasn’t ready for the next phase.

One of my first hires at Bonial was Al Villegas, an experienced technologist who I asked to focus on enterprise architecture.  He was a great fit as he had the right mix of broad systems perspective and a roll-up-his-sleeves / lead-from-the-front mentality.  He and I collaborated on big-picture “as-is” and “to-be” diagrams that highlighted the full spectrum of enterprise domains and showed clearly where we needed to invest going forward.   Fortunately we version and save the diagrams, so here are the originals:

Original 2014 “As Is” High Level Enterprise Architecture
Original “To Be” 2015 High Level Enterprise Architecture

These pictures served several purposes: (1) they gave us an anchor point for defining and prioritizing long-term platform initiatives, (2) they let us identify the domains that were misaligned, underserved or needed the most work, and (3) they gave every engineer additional context as they developed their solutions on a day-to-day basis.

Then the hard work started.  We would have loved to do everything at once, but given the realities of resource constraints and business imperatives we had to prioritize which runways to develop first.  As described in other articles of this series, we focussed early on our monitoring frameworks and breaking up the monolith.  In parallel we also started a multi-phase, long-term initiative to overhaul our tracking architecture and data pipelines.  Later we moved our software and data platforms to AWS in phases and adopted relevant AWS IaaS and SaaS capabilities, often modifying or greatly simplifying elements of the architecture in the process.  Across the span of this period, we continually refined and improved our APIs, moving to a REST-based, event-driven micro-services model from the dedicated/custom approach previously used. We also invested in an SDLC runway, building tools on top of the already mature devops capabilities to further accelerate the development process. 

The end result is a massive acceleration effect.  For example, we recently implemented a first release of a complex new feature involving sophisticated machine-learning personalization algorithms, new APIs and major UI changes across iOS, Android and web.  The implementation phase was knocked out in a couple of sprints.  How?  In part because the cross-functional team had available a rich toolbox of capabilities that had been laid down as part of the architecture runway: REST APIs, a flexible new content publishing system, a massive data-lake with realtime streaming, a powerful SDLC / staging system that made spinning up new production systems easy, etc.  The absence any of these capabilities would have added immensely to the timeline.

The architecture continues to evolve.  We’ve recently added realtime machine learning and AI capabilities as well as integrations with a number of external partners, both of which have extended the architecture and brought both new capabilities and new (and welcome) challenges.  We are continually updating the “as is” picture, adapting the architecture strategy to match the needs of the business, and investing into new runway.

And the cycle continues.

Closing Thoughts

  • Companies should start with a simple single solution – that’s fine, it’s important to live to fight another day.  But eventually you’ll need a defined architecture and runway.
  • Start with a “big picture” to give everyone context and drill down from there.
  • Don’t forget the business systems: sales force automation, order management, CRM, billing, etc.  As much as everyone likes to focus on product delivery, it’s the enterprise systems that run the business.
  • Create a long-term architectural vision to help guide the big, long-term investments.

Accelerated Velocity: Enabling Independant Action

Inefficiency drives me crazy.  Its like fingernails on a chalkboard.  When I’m the victim of an inefficient process, I can’t help but stew on the opportunity costs and become increasingly annoyed.  This sadly means I’m quite often annoyed since inefficiency seems to be the natural rest state for most processes.

There are lots of reasons why inefficiency is the norm, but in general they fall into one of the following categories:

1) Poor process design

2) Poor process execution

3) Entropy and chance

4) External dependencies

The good news in software development is that Lean/agile best practices and reference implementations cover process design (#1).  Process execution (#2) can likewise be helped by hiring great people and following agile best practices.  Entropy (#3) can’t, by definition, be eliminated but the effects can be mitigated by addressing the others effectively.

Which leaves us with the bane of efficient processes and operations: dependencies (#4). 

Simply put, a dependency is anything that needs to happen outside of the process/project in question in order for the process/project to proceed or complete.  For example, a software project team may require an API from another team before it can finish its feature.  Likewise a release may require certification by an external QA team before going to production.  In both cases, the external dependency is the point where the process will likely get stuck or become a bottleneck, often with ripple effects extending throughout the system.  The more dependencies, the more chances for disruption and delay.

So how does one reduce the impact of dependencies?

The simplest way is to remove the dependencies altogether.  Start by forming teams that are self-contained, aligned behind the same mission, and ideally report to the same overall boss.  Take, for example, the age-old divisions between product, development, QA, and operations.  If these four groups report to different managers with different agendas, then the only reasonable outcome will be pain.  So make it go away!  Put them all on the same team. Get them focussed on the same goals.  Give them all a stake in the overall success.

Second, distribute decision making and control.  Any central governance committee will be a chokepoint, and should only exist when (a) having a chokepoint is the goal, or (b) when the stakes are so high that there are literally no other options.  Otherwise push decision-making into the teams so that there is no wait time for decisions.  Senior management should provide overall strategic guidance and the teams should make tactical decisions.  (SAFe describes it well here.)

In 2014, Bonial carried a heavy burden of technical and organization dependencies and the result was near gridlock. 

At the time, engineering was divided into five teams (four development teams and one ops team), and each team had integrated QA and supporting ops.  So far, so good.  Unfortunately, the chokepoints in governance and the technical restrictions imposed by a shared, monolithic code-base effectively minimized independent action for most of the teams, resulting in one, large, inter-connected mega-team.

There was a mechanism known as “the roadmap committee” which was nominally responsible for product governance, but in practice it had little to do with roadmap and more to do with selective project oversight.  One of the roadmap committee policies held that nothing larger than a couple of days was technically allowed to be done without a blessing from this committee, so even relatively minor items languished in queues waiting for upcoming committee meetings.   

What little did make it through the committee ran directly into the buzzsaw of the monolith.  Nearly all Bonial software logic was embedded in a single large executable called “Portal3”.  Every change to the monolith had to be coordinated with every other team to ensure no breakage.  Every release required a full regression test of every enterprise system, even for small changes was on isolated components.   This resulted in a 3-4 day “release war-room” every two weeks that tied down both ops and the team unfortunate enough to be on duty. 

It was painful.  It was slow.  Everyone hated it.

We started where we had to – on the monolith.  Efforts had been underway for a year or more to gradually move functionality off of the beast, but it became increasingly clear with each passing quarter that the “slow and steady” approach was not going to bear fruit in a timeframe relevant to mere mortals. So our lead architect, Al, and I decided on a brute force approach: we assembled a crack team which took a chainsaw to the codebase, broke it up into reasonably sized components, and then put each component back together. Hats off to the team that executed this project – wading through a spaghetti of code dependencies with the added burden of Grails was no pleasant task.  But in a few months they were done and the benefits were felt immediately.

The breakup of the monolith enabled the different teams to release independently, so we dropped the “integrated release” process and each team tested and released on their own.  The first couple of rounds were rough but we quickly hit our stride.  Overall velocity immediately improved upon removing the massive waste of the dependent codebase and labor-intensive releases.

The breakup of the monolith also untethered the various team roadmaps, so around this time we aligned teams fully behind discreet areas of the business (“value streams” in SAFe parlance). We pushed decision making into the teams/streams, which became largely responsible for the execution of their roadmap with guidance from the executive team.  The “roadmap committee” was disbanded and strategic planning was intensified around the quarterly planning cycle.   It was, and still is, during the planning days each quarter that we identify, review and try to mitigate the major dependencies  between teams.  This visibility and awareness across all teams of the dependency risk is critical to managing the roadmap effectively.

Eventually we tried to take it to the next level – integrating online marketing and other go-to-market functions into vertically aligned product teams – but that didn’t go so well.  I’ll save that story for another day.

The breakup of the monolith and distribution of control probably had the biggest positive impact in unleashing the latent velocity of the teams.  The progress was visible.  As each quarter went by, I marveled at how much initiative the teams were showing and how this translated into increased motivation and velocity. 

To be sure, there were bumps and bruises along the way.  Some product and engineering leaders stepped up and some struggled.  Some teams adapted quickly and some resisted.  Several people left the team in part because this setup required far more initiative and ownership than they were comfortable with.  But in fairly short order this became the norm and our teams and leaders today would probably riot if I suggested going back to the old way of doing things.

Some closing thoughts:

  • Organize teams for self-sufficiency and minimal skill dependencies
  • Minimize or eliminate monoliths and shared ownership
  • Keep the interface as simple, generic and flexible as possible when implementing shared systems (e.g. APIs or backend business systems) 
  • Build transparent about dependencies and manage them closely