Rules of Effective Recruiting

The proverbial needle in a haystack…

I’ve frequently been asked variations of the  question: “How does one hire the right people?”  While I by no means have a perfect record in this area, over time I’ve developed some heuristics that allow me to generally recruit good people (or at least avoid not so good people).  

Rule 1: Hire for the Future

The typical hiring process resembles detective work, with the central focus being a deep forensic dive into what a person has done in the past.  In the end, one is left with a bucketful of self-reported “facts” and a gut feeling over whether the person might be a good fit. This is a lousy way to recruit.

Why?  Simple – you’re looking in the wrong direction.  When recruiting you should be trying to uncover what the person will do for you in the future, not what they’ve done in the past.

That isn’t to say that the past has no relevance.  One should use past performance and experience as a rough indicator of potential future performance, but only if past performance can be truly known.  Most of the time you only have the candidate’s word and whatever you can read between the lines on the resume. (Ironically this plays into the hands of people who have learned to interview well but are otherwise incompetent.  Beware – there’s more of them than you might think.)

So focus on the future.  How? By asking the candidate to think and solve problems.  As information workers this is what we do every day, so you should see how they do it.  Give them case studies and assess how they approach the problem. Observe how they deal with ambiguity (another thing we deal with every day).  Get a feel for their analytical skills. See how they deal with challenges thrown at them – are they confident and collaborative or do they bristle and push back.  

Some challenges I’ve used in the past for engineering candidates include:

  • Design a [insert design challenge here – usually something like an online game or other complex scenario].  Show me on a blackboard.
  • Here’s a [ insert business problem here ].  How would you go about developing a solution?  Show me on a blackboard.
  • You’re starting a company tomorrow and you head up engineering.  How will you set up your SDLC? Show me on a blackboard.
  • The system is struggling and here are the symptoms [ insert symptoms ].  How would you go about identifying and resolving the root cause? Show me on a blackboard.
  • Here’s some data representing [ insert dataset description ].  What does the data tell you about the situation? How would you solve the problem?

One amusing anecdote – I was asked to meet with a candidate for a senior engineering role.  It was mostly a courtesy interview – the team was enamored by his magnetic personality and was ready to hire him.  My job was nominally to “sell” him on our company.  But just for good measure I wanted to get a sense for his analytical ability.  As we were meeting in a lounge without any blackboards I made up an abstract question vs. a specific engineering domain question: “How would you turn over that coffee cup sitting in front of you without touching it with any part of your body?”  I thought this was a softball question – in formulating the question I’d already come up with several ideas ranging from crazy to practical.  He just stared at me.  And stared.  And stared until it was long past the point of being uncomfortable.  Eventually he said he wasn’t prepared for that kind of question and had no answer.  We didn’t hire him.  (Note: I’m not a fan of “brain teaser” interviews so this isn’t how I normally operate, but it’s telling when someone completely freezes over a simple abstract problem.)

Rule 2: Never Settle

The most expensive thing any company can do is hire unqualified or incompatible people.  Take Germany for example – you spend 3 months recruiting, another 3 months waiting for the person to end their old contract.  Then 2-3 months before you realize they may not be a good fit. Another few months of remediation before you let them go. Then another 3 months of recruiting… Your bad hiring choice basically leaves you without a person in that role for a year or more.

Never settle.  Period. Hire only people as good or better than the average person on the broader team already.  It’s way better to have an empty seat than a seat filled (or unfilled) with a problem.

Rule 3: Recruit for Talent, Not Skills

Any smart, motivated, analytical person can learn specific tools or languages.  Not everyone who knows specific tools or languages is smart, motivated or analytical.   So weight your decision heavily towards raw talent vs specific skills when hiring for a permanent team member.  This means assessing their general problem solving ability and broad understanding of core computer science concepts.  It’s fine to also look at their specific skills, but ask yourself whether gaps in any skills really should drive a “no-go” decision.   Some of the very best engineers I’ve worked with or hired knew very little about the domain or specific programming languages at the start but ramped up quickly and became superstars as a result of their raw talent.

Of course, if you need a specific skill now – knowledge of a particular software package, for example – then by all means make that required.  In that case, though, you should consider bringing on a contractor or freelancer until you have the right permanent team member ready to take on the role.

Rule 4: Key Indicators

Not all factors should be considered equal when evaluating a candidate.  Being human, we overweight “personality,” “self-confidence” and other soft attributes.  These may be important but often weak indicators of future performance and team fit. You need to be intentional about down-weighting those factors and staying focussed on the critically important attributes.  These include:

  • Accountability.  People who lack accountability and ownership will fail.  My experience is that low ownership correlates almost 100% with a bad outcome.  Avoid these people like the plague. How do you gauge accountability? Ask questions about projects or other initiatives that didn’t go so well and see if they take some of the responsibility.  Ask what they’d do in the future to get a different outcome and listen for whether they share responsibility for the improvements. Role play with a difficult challenge that has plenty of excuses and see if they take the excuse.  Be creative. 
  • Problem solving.  Creative analytical abilities are the foundation of any problem solver, and problem solving is what we do.  To do this you just need to observe them solving problems. Put them on a blackboard and give them open-ended questions.   
  • Integrity.  If you can’t trust them 100% then you definitely don’t want them on the team.  This is tough since only truly incompetent people will obviously lie to you, but you can ask questions about their resume and drill down a bit in areas where you have expertise to see if they represent themselves accurately.  You can ask questions multiple times with subtle variations to see if the answer changes. You can always verify references or facts listed in the resume. 
  • “Fire in the Belly”.  You want people who will show up on day 1 and apply themselves 100%.  This is sometimes tough to discover in interviews so is often something you have to assess early in the on boarding.
  • Respect/Arrogance.  Will they treat their teammates with respect or will they lord their excellence over everyone?  You don’t want them on your team if the latter. No one likes working with assholes. Most likely they’ll defer to you, the hiring manager, but watch how they interact with the receptionist when they arrive, waitstaff when you take them to lunch, or anyone else not involved in the process.  You can also get a glimpse of this if you challenge them hard during an analytical exercise and see how they respond under pressure. 

These five (and I’m sure there are more) are strong “fail” signals.  If the person is lacking any of these attributes it’s highly likely they’ll fail as a colleague.  

The inverse, however, is not necessarily true.   Someone who possesses these attributes may not be a great team member, but at least they have a good head start.  So use these as a filter gate.

Rule 5: Group Think

So, who decides if a candidate is right for the team?  One common approach is to have a hiring manager make the call.  It’s simple. It empowers the hiring manager. But in the end it’s a bad idea.

Why?  For several reasons, but I’ll focus on two.  First, we all suffer from biases. We tend to overweight certain attributes (e.g. those most like us) and we each have our pet peeves.  We also see what we see in the frame of our past experience. All of this together means we’re susceptible to making the hiring decision on the wrong factors.

Second, the reality is that we’re hiring for the company as much as we’re hiring for a specific team.  Engineers move teams all the time, whether for their own career growth or because the mission changes. You need people who are not only a good fit for their team but for the organization as a whole.

You can solve both problems by not going it alone.  Get several people involved including people from outside the team.  Group interviews are fine to save time, but make sure each interviewer has specific things they’re looking for.  At the end of the interview cycle take a confidence vote – I’d argue that you want unanimous agreement on a 4 or better out of 5 else you should probably pass on the candidate.  

Accelerated Velocity: Doing the Right Things

Note: this article is the last part of a series called Accelerated Velocity.  This part can be read stand-alone, but I recommend that you read the earlier parts so as to have the overall context.

So far in this series I’ve shared thoughts on how to do things right – how to leverage best practices and develop skilled practitioners to get excellent results.  Doing things right, however, doesn’t mean you’re doing the right things; it could mean you’re just doing the wrong things much faster.

The hard truth is that doing things right is easier than the doing the right things.  The path to the former takes hard work but is relatively clear and straightforward.  The path to doing the right things is considerably more opaque and mysterious.  Just compare the number of books and blogs describing how to build software vs what to build with software to see the impressive gap between the two.

I’ve spent most of my career working to do both.  My primary responsibilities as an engineering leader have been to ensure the team is working effectively and efficiently.  But in my various executive and consulting roles I’ve had both the opportunity and obligation to be a thought leader in the areas of business, product and platform strategy.  Through these roles I’ve developed a deep respect for the challenges and upsides of choosing the right path.  I’ve also learned that an engineering leader who isn’t concerned with the question of “are we working on the right things” is doing their team a huge disservice.

There isn’t a formula or cookbook I’ve discovered that guarantee success, but I’ve found there are several ingredients which radically improve your chances of doing the right things as an engineering team.  We do all of these – some better than others – at Bonial.

Data / Situational Awareness

You can’t make good decisions about where to invest if you don’t know what’s going on with your systems or your users.  In a previous article I discussed at length why this is critical and how Bonial developed situational awareness around system performance and stability. 

Its just as important to know your users.  Note that I didn’t say, “know what you’re users are doing”.  That’s easy and only tells part of the story.  What you really want to know is “why” they are doing what they’re doing and, if possible, what they ”want” to do in the future.  That’s tough and requires a multi-faceted approach.

For this you’ll want both objective and subjective data to create a complete picture.  Objective data will come from event tracking and visualization (e.g. Google Analytics or home-grown data platforms like the Kraken at Bonial).  Subjective data will come from usability studies, user interviews, app reviews, etc.  Combined, this data and intelligence should enable you to paint a pretty good picture of the user.

As with most things this too has its limits.  Data is inherently backward-looking.  It will tell you what users have done and what they have liked, but extrapolating that into the future is a tricky exercise.  Even talking to users about the future doesn’t help much since they are notoriously bad at predicting how their perspective will change when faced with new paradigms.   

So treat your data as guidance and not gospel, and constantly update the guidance.  Run experiments based on hypotheses derived from the historical data and challenge them with new data.  If the experiment is sound and validates the hypothesis you can move forward with relative confidence. 

When in doubt, trust the data. 

ROI Focus

Building things for fun is everyone’s dream and many teams succumb to this temptation.  Some succeed; most fail.  Considering return on investment (ROI) can help avoid this trap.   Teams that are ROI focussed ask themselves how the R&D investment will be paid back and, hopefully, also show that the payback was realized.  This desired result is a focus on those things that have the potential to matter most. 

Great, right?

Maybe; there are pitfalls.  Modeling ROI is not easy and the models themselves can be overly simple or (too often) complete crap.  The inverse is also true – people can spend so much time on the modeling that any benefit to velocity is lost.  It takes practice and time to find the right balance.

Some of the toughest ROI choices involve comparing features against non-functional requirements (NFRs) like stability, performance and technical debt.   An easy solution is to not beat your head against this “apples to oranges” problem; instead, give each team a fixed “time budget” for managing technical debt and investing in the architecture runway.  This will create some push-back in the short term (especially among product owners who want more capacity for features), but in the long term everyone will appreciate the increased velocity you’ll realize from making regular investments.  At Bonial we ask teams to allocate roughly 40% of their capacity to rapid response, technical debt reduction and architecture runway development.  That may seem like a lot, but if it makes the other 60% 7x faster, everyone wins.

In the end, treat ROI as a guideline.  I think you’ll find that the simple act of asking people to think in these terms will elevate the conversations and make some tough decisions easier.


The more people that know your business, the better.  Your engineers, testers, data scientists, operations specialists, designers each make dozens or hundreds of decisions a day, small and large, that affect the business.  Most of these decisions will require them to extrapolate details from the general guidance.  If they don’t understand the business, or more specifically, the “why” of the guidance, then there’s a good chance they’ll miss the mark on the details. 

So take the time to explain the “why” of decisions.  Educate your people on business fundamentals.  Share numbers.  Answer their questions.  And, most important, be honest even if there’s bad news to share.  Its better that they are armed with difficult facts than confused with half-truths and spin.  You’ll be surprised at how many people will respond positively to the respect you show them by being honest.

Calling Bullshit

Some companies work under a model in which engineering is expected to meekly follow orders from whoever is driving the product strategy.  This is foolish to the point of being reckless.  Some of the smartest people and most analytical thinkers in your company are in the R&D organization.  Why cut that collective IQ out of the equation?

Smart companies involve the engineering teams in ideation as well as implementation.  The best companies go one step further – they give engineering implicit control over what they build.  Product manager or other stakeholders have to convince engineering of their idea; there is no dictatorial power.   

Some may fear that this leads to a situation where the product authority becomes powerless or marginalized.  While I’ve seen a number of product teams that were largely side-lined, it was never because they weren’t given enough authority – it was because they didn’t establish themselves as relevant.  Good, competent product managers need to win over the engineers and stakeholders with demonstrated competence. 

At Bonial, the product team has the responsibility for prioritizing the backlog but engineering team has the responsibility for committing to and delivering the work.  This split gives a subtle but implicit veto to the engineering team.  Most of the time the teams are in sync, but at teams the engineers call “bullshit” and refuse to accept work – usually due to an unclear ROI or clear conflict with stated goals.  This results in some short-term tension but over the long-term this leads to healthy relationships between capable product managers and engaged engineering teams.

People who Think Right

My mentor used to say that, “some people think right, and some don’t.”  What he meant was that some people have a knack for juggling ambiguity; when faced with a number of possible choice, they are more likely than not to pick one of the better choices.  People who ”think right” thrive in a leader-leader environment; people who don’t are dangerous. 

Why?  Because after all the data has been collected, all of the models have been built and all of the (unbiased) input has been collected, decisions still need to be made.  More often than not there will be several options on the table.  Certainty will be elusive.  In the end there’s an individual making a choice using all of the analytic, intuitive, conscious and sub-conscious tools available to them.  Make consistently right decisions and you have a fair shot at success.  Make consistently wrong decisions and you’ll likely fail.

Some people are far better at making the right decisions.  These are the people you want in key roles.

The trick is how to best screen for these people.  At Bonial we use open-ended cases studies and other “demonstrations of thought and work” during the recruiting process to get a glimpse on how people think.  We’ve found this to be very effective at screening out clear mismatches, but a short, artificial session can only go so far.  After that it’s a matter of observation during trial periods and, eventually, selection for fitness through promotions. 

Closing Thoughts

“Doing the right things” is an expansive topic.  This article just scratches the surface; I could probably write a book on this topic alone.  Once you have the basics of SDLC execution in place – good people, agile processes, devops, architectural runway, etc. – the main lever you’ll have to drive real business value is in doing the right things.  Unfortunately this is much, much tougher than doing things right.  It very quickly gets into the messy realm of egos, politics, control, tribalism and the like.  But it can’t be avoided if you want to take your team to the next level. 

Good luck.

  • It’s not enough to “do things right” – you have to also “do the right things” if you don’t just want to build the wrong things faster
  • Use data and a consider ROI to guide your decisions
  • Put people who have context and ”think right” in charge of key decisions
  • Engage the whole team and create checks and balances so bad ideas can’t be ramrodded through the process

Accelerated Velocity: Getting Uncomfortable

Note: this article is part 10 of a series called Accelerated Velocity.  This part can be read stand-alone, but I recommend that you read the earlier parts so as to have the overall context.

“Confident.  Cocky.  Lazy.  Dead.” This admonishment against complacency was the mantra of Johnny “Dread” Wulguru, the villain in Tad William’s Otherland saga.  As true as it is for assassins like Dread, it’s also true (though perhaps not as literally lethal) for teams and companies that choose to rest on their laurels and stop challenging themselves.

Complacency is the enemy of innovation.  This has been proven over and over throughout history in every domain as once successful or dominant players suddenly found themselves lagging behind.  This is also a leadership failure.  Good leaders strive to prevent comfort from becoming complacency. 

Jeff Bezos at Amazon has baked this into the DNA of the company as enshrined in his “Day 1” message to shareholders.  Of note is what Mr. Bezos says happens when companies get comfortable on Day 2:

“Day 2 is stasis. Followed by irrelevance. Followed by excruciating, painful decline. Followed by death. And that is why it is always Day 1.”

Confident.  Cocky.  Lazy.  Dead.

It’s not easy, though.  Comfort is the reward for success after all.  “Don’t fix something that ain’t broke.”  Right?

Wrong.  The reward for success is being in position to hustle and build on that success.  Period. 

Getting Uncomfortable

It’s no different in software development teams. 

By 2016 we’d made significant strides in velocity and efficiency in Bonial product development.  Processes were in place.  An architecture roadmap existed.  Teams were healthy.  The monolith was (mostly) broken.  AWS was being adopted.  All signs pointed to very successful changes having taken place. 

At the same time I felt a certain sense of complacency settling in.  The dramatic improvements over the past couple of years had some thinking that it was now “good enough.”  Yet we still had projects that ran off the rails and took far longer than they should have.  We still couldn’t embrace the idea of an MVP.  We still had mindsets that change was dangerous and scary.  And, perhaps most important, many had a belief that we were as fast as we could or needed to be. 

Yes, we were better and faster, but I knew we had only begun to tap our potential.  It was time to get uncomfortable.

Engineering Change

Changing a deep-seated mindset in a large organization using a head-on approach is tough.  An easier and often more effective approach is to engineer and successfully demonstrate change in smaller sub-groups and spread out from there.  Once people see what is possible or, better yet, experience it themselves, they tend to be quite open to change. 

So we looked for opportunities to challenge individual teams to “think different”.  For example, on several occasions the company needed an important feature insanely fast.  Rather than say no, we asked teams to work in “hackathon mode” – essentially, to do whatever it took to get something to market in a few days even if the final solution was wrapped in duct tape and hooked to life support.  Not surprisingly we usually delivered and the business benefitted massively.  Yes, we then had to spend time refactoring and hardening to make the solution really stable, but the feature was in the market, business was reaping the benefits and the teams were proud of delivering fast.

On another occasion we had a team that struggled with velocity due, in part, to lack of test automation and an over-reliance on manual testing.  So I challenged the team to deploy the next big feature with zero manual testing; they had to go to production with only automated tests.  This made them very uncomfortable.  I told them I had their backs if it didn’t work out – they only needed to give it their best shot.  To their credit, the team signed up for the challenge and the release went out on time and had no production bugs.  This dramatic success made a strong statement to the rest of the organization.

Paradigm Shift

We also took advantage of our new app ecosystem.  Over the past few years the company has started several new “incubation” initiatives to explore new possibilities and expand our product portfolio.  We didn’t want to do these initiatives with our core product development teams because (a) we didn’t want to be continually wrestling with questions as to whether to focus on the new or old products, and (b) we feared that doing things like the “core” teams would be too slow. 

So we spun out standalone teams with all of the resources needed to operate independently.  Not surprisingly these “startup” teams moved much faster than any of our core teams.  In part this was because they were not burdened by legacy systems, technical debt, and risk/exposure of making mistakes that affect millions of users. But I think the bigger part was sheer necessity.  We ran our incubation projects like mini startups – they received funding, a target and a timeline and they had to hit those targets (or at least show significant progress) in order to receive more funding.  As a result, the teams were intensely focussed on delivering MVPs as quickly as possible, measuring the results in the market, and pivoting if needed. 

Between 2015 and 2017 we ran three major incubation projects and each one was faster than the last.  The most recent, Fashional, went from funding to launch in less than 12 weeks, which included ramping up development teams in two countries, building web and native mobile apps and lining up initial partners and marketing launch events.  This proven ability to move fast made a strong statement to our other teams. 

We soon had “core” teams making adjustments and shifting their thinking.  Over the next few quarters, every team had embraced a highly iterative, minimalistic approach to delivery that enabled us to try more things more quickly and, when needed, take more aggressive risks.   Now each team strives to deliver demonstrable, user-facing value every sprint.  Real value, not abstract progress.  Just like the agile book says.  This isn’t easy but is fundamentally required to drive minimalistic, iterative thinking.  The result is dramatic improvement in velocity while having more fun (success is fun). 

For sure this hasn’t been perfect.  Even today we still have teams that struggle to plan and deliver iteratively and we still have projects that take way too long.  On the flip side we have a much deeper culture of challenging ourselves, getting uncomfortable and continually improving. 


Closing Thoughts

  • It’s easy to become complacent, especially after a period of success.  This is deadly.
  • Leaders must act to remove complacency and force themselves and their teams to “get uncomfortable” and push their own limits.
  • Break the problem into smaller chunks.  Work with entrepreneurial teams on initiatives that challenge the status quo.  Have them show the way.
  • Reward and celebrate success and make sure you have the team’s back.  Honor your commitments.

Accelerated Velocity: Optimizing the SDLC with DevOps

Note: this article is part 9 of a series called Accelerated Velocity.  This part can be read stand-alone, but I recommend that you read the earlier parts so as to have the overall context.

Anyone who’s run a team knows there’s always more to do and pressure to do even more than that.  Most managers will respond by asking for more workers and/or make the existing workers put in more hours.  I’ve been there myself, and learned the hard way that there is a better, faster and cheaper way: focus on efficiency first.  Let’s look at why.

For clarity and simplicity, I’ll frame this in a simple mathematical representation:

[ total output ] = [ # workers ] x [ hours worked ] x [ output / worker-hour (i.e. efficiency) ]

In other words, output is a factor of the number of people working, the amount of time they work, and their efficiency.  So if we want to increase total output, we can:

  1. Increase the number of workers.
  1. Increase hours worked.
  1. Increase efficiency.

Pretty straightforward so far, but when we start pulling the threads we quickly discover that the three options are not equal in their potential impact.  The problem with #1 is that it is expensive, has a long lead time and, unless the organization is very mature, generally results in a sub-linear increase in productivity (since it also adds additional management and communication load).  #2 may be ok for a short time, but it’s not sustainable and will quickly lead to a significant loss in efficiency as people burn out, start making expensive mistakes and eventually quit (which exacerbates #1). 

That leaves us with #3.  Efficiency improvements are linear, have an immediate impact and are relatively cheap when compared to the other options.  This is where the focus should be.

Really?  It can’t be that simple…

It is.  Let’s look at a real-world example.  You have a 20 person software development team.  To double the output you could hire an additional 22-25 FTE (Full Time Equivalents) and start seeing increased velocity in maybe 3-6 months.  (Why not just 20?  Because the you also need to hire more managers, supporting staff, etc.  You also have to account for the additional burden of communication.  That’s why this is non-linear.) 

You could ask them to work twice as many hours, but very quickly you’ll find yourself processing the flood of resignation letters.  Let’s cross this off the list.

Or you could ask each person on the team to spend 10% of their time focussing on tools, techniques, frameworks and other boosts to efficiency.  If done right, you’ll start seeing results right away and in short order you can cut the development cycle time by as much as half.  In effect you’ve doubled efficiency (and therefore output) for the equivalent of 2 FTE (20 people x 10%).  In economic terms, this would be 10:1 leverage. 

Not bad.  This is why I recommend always focussing on efficiency first.

(This isn’t to say that growing the team is the wrong thing to do – Google wouldn’t be doing what Google does with 1/10 the engineers – but this is an expensive and long-term strategy. Team growth is not a substitute for an intense focus on efficiency.  Adding a bunch of people to an inefficient setup is a good way to spend a lot of money with low ROI.)

So what would the team focus on to boost efficiency?  The list is long and includes both technical (reducing build times, Don’t Repeat Yourself (DRY), etc.) and non-technical (stop wasting time in stupid meetings) topics.  All are valid and should be addressed (especially stupid meetings and time management), but in this article I’m going to focus on leverage through automation and software driven infrastructure; in other words: devops.


Ask ten people and you’ll get ten different answers as to what devops is.  I’m less interested in dogmatic purity and more in the foundational elements of devops that drive the benefits, which to me are:

  1. Tools: an intense focus on increased efficiency through DRY / automation.
  1. Culture: increased efficiency through eliminating arbitrary and stupid organization boundaries.

Bonial understood this.  When I arrived in 2014, nearly all of the major build and deployment functions were fully scripted and supported by a fully integrated and aligned “devops” team (though it wasn’t called that at the time), which had even gone so far as to enable interactions with the scripts via a cool chat bot named Marvin.  Julius, Alexander and others on the devops team were wizards with automation and were definitely on the cutting edge of this evolving field.  For the most part we had a full CI capability in place.

Unfortunately, future gains were largely blocked by code and environmental constraints.  No amount of devops can solve problems created by code monoliths and limited hardware.  So, as described in other articles in this series, we invested heavily in breaking up the monolith.  It was painful, but opened up many pathways for team independence, continuous delivery and moving to cloud infrastructure.  After we’d broken apart and modularized the code monolith, we moved into AWS which created further challenges. On one hand we wanted everybody to make full use of the cloud’s speed and flexibility. On the other hand it was important to ensure governance processes for e.g. cost control and security. We balanced those requirements with infrastructure as code (IaC) and we standardized on a few automation platforms (Spinnaker, Terraform, etc.) but let teams customize their process to meet their needs.  At this point, the central “devops” team became both a center-of-excellence and a training and mentoring group.   Our foundation in automation enabled us to very rapidly embrace and adopt IaaS and explore server-less and container approaches.  It took some time to settle on which automation frameworks would best meet our needs, but once that was done the we could spin up entire environments in minutes, starting with only some scripts and ending up with a fully running stack. Due to the sheer speed of changes, adopting a “You Own It You Run It” (YOIYRI) approach and moving more responsibilities into the teams came naturally. All those changes took us to a whole new level.  


The natural next step was to address one of the most painful and persistent bottlenecks in our previous software development lifecycle (SDLC) – development and test environments (“stages”).  Previously, all of the development teams had had to share access to handful of “stage” environments, only one of which was configured to look somewhat like production.  This created constant conflict between teams who were jockeying for stage access and no-win choices for me on which projects to give priority access. 

Automation on AWS changed the game for the SDLC. Now instead of a single chokepoint throttling every team, each team could spin up its own environment and we could have dozens of initiatives being developed and tested in parallel.  (We also had to learn to manage costs more effectively, especially when people forgot to shut off their stage, but that’s another story…) 

We still had the challenge of dependencies between teams – e.g. what if the web team needed a specific version of a certain API owned by another team to test a new change.  One of our engineers, Laurent, solved this by creating a central repository and web portal (“Stager”) that any developer could use to spin up a stage with any needed subsystems running inside.  

As it stands today, the “Stager” subsystem has become a critical piece our of enterprise; when its down or not working properly people make noise pretty quick.  As befits its criticality, we’ve assigned dedicated people for focus purely on ensuring that its up and running and continually evolving to meet our needs.  Per the math above, the leverage is unquestionable and investing in this area is a no-brainer.

Closing Thoughts

  • It’s simple: higher efficiency = more output
  • Automation leads to efficiency
  • Breaking down barriers between development and ops leads to efficiency
  • Invest in devops

Many thanks to Julius for both his contributions to this article and, more importantly, the underlying reality.

Accelerated Velocity: Creating an Architectural Runway

Note: this article is part 8 of a series called Accelerated Velocity.  This part can be read stand-alone, but I recommend that you read the earlier parts so as to have the overall context.

Most startups are, by necessity and by design, minimalistic when it comes to feature development.  They build their delivery stack (web site or API), a few tools needed to manage delivery (control panel, CMS) and then race to market and scramble to meet customer requests.  Long term architecture thinking is often reduced to a few hasty sketches and technical debt mitigation is a luxury buried deep in the “someday” queue. 

At some point success catches up and the tech debt becomes really painful.  Engineers spend crazy amounts of time responding to production issues which they could have used to develop new capabilities.  New features take longer and longer to implement.  The system collapses under new load.  At this point tweaks won’t save the day.  An enterprise architecture strategy and runway is needed.

What is an architecture runway?  In short it’s a foundational set of capabilities aligned to the big picture architecture strategy that enable rapid development of new features.  (SAFe describes it well here.)  In plain english – it’s investing in foundational capabilities so features come faster.

The anchor of the architecture runway is, of course, the architecture itself.   I’m not going to wade into the dogmatic debate about “what is software architecture”; rather, I’ll simply state that a good architecture creates and maintains order and adaptability within a complex system.  The architecture itself should be guided by a strategy and long-term view on how the enterprise architecture will evolve to meet the needs of the business in a changing market and tech-space.   

In developing an architecture strategy and runway, architects should start with the current state. At the very least, create a simple diagram that gives context to everyone on the team as to what pieces and parts are in the system and how they play together.   Once the “as is” architecture is identified and documented, the architects can roll up their sleeves and develop the “to be” picture, identify the gaps between the two states, and then develop a strategy for moving towards the “to be”.  The strategy can be divided into discreet epics / projects, and construction of the runway can begin.

Bonial’s Architecture Runway

Success had caught up to Bonial in 2014.  Given the alternative I think everyone would agree that that’s the right problem to have, but it was a problem none-the-less.  The majority of the software was packaged into a single, huge executable called “Portal3,” which contained all of the business logic for the web sites, mobile APIs, content publishing system and a couple dozen batch jobs.  There were a few ancillary systems for online marketing and some assorted scripts, but they were largely “rogue” projects which didn’t add to the overall enterprise coherence.  While this satisfied the immediate needs and had fueled impressive early growth and business success, it wasn’t ready for the next phase.

One of my first hires at Bonial was Al Villegas, an experienced technologist who I asked to focus on enterprise architecture.  He was a great fit as he had the right mix of broad systems perspective and a roll-up-his-sleeves / lead-from-the-front mentality.  He and I collaborated on big-picture “as-is” and “to-be” diagrams that highlighted the full spectrum of enterprise domains and showed clearly where we needed to invest going forward.   Fortunately we version and save the diagrams, so here are the originals:

Original 2014 “As Is” High Level Enterprise Architecture
Original “To Be” 2015 High Level Enterprise Architecture

These pictures served several purposes: (1) they gave us an anchor point for defining and prioritizing long-term platform initiatives, (2) they let us identify the domains that were misaligned, underserved or needed the most work, and (3) they gave every engineer additional context as they developed their solutions on a day-to-day basis.

Then the hard work started.  We would have loved to do everything at once, but given the realities of resource constraints and business imperatives we had to prioritize which runways to develop first.  As described in other articles of this series, we focussed early on our monitoring frameworks and breaking up the monolith.  In parallel we also started a multi-phase, long-term initiative to overhaul our tracking architecture and data pipelines.  Later we moved our software and data platforms to AWS in phases and adopted relevant AWS IaaS and SaaS capabilities, often modifying or greatly simplifying elements of the architecture in the process.  Across the span of this period, we continually refined and improved our APIs, moving to a REST-based, event-driven micro-services model from the dedicated/custom approach previously used. We also invested in an SDLC runway, building tools on top of the already mature devops capabilities to further accelerate the development process. 

The end result is a massive acceleration effect.  For example, we recently implemented a first release of a complex new feature involving sophisticated machine-learning personalization algorithms, new APIs and major UI changes across iOS, Android and web.  The implementation phase was knocked out in a couple of sprints.  How?  In part because the cross-functional team had available a rich toolbox of capabilities that had been laid down as part of the architecture runway: REST APIs, a flexible new content publishing system, a massive data-lake with realtime streaming, a powerful SDLC / staging system that made spinning up new production systems easy, etc.  The absence any of these capabilities would have added immensely to the timeline.

The architecture continues to evolve.  We’ve recently added realtime machine learning and AI capabilities as well as integrations with a number of external partners, both of which have extended the architecture and brought both new capabilities and new (and welcome) challenges.  We are continually updating the “as is” picture, adapting the architecture strategy to match the needs of the business, and investing into new runway.

And the cycle continues.

Closing Thoughts

  • Companies should start with a simple single solution – that’s fine, it’s important to live to fight another day.  But eventually you’ll need a defined architecture and runway.
  • Start with a “big picture” to give everyone context and drill down from there.
  • Don’t forget the business systems: sales force automation, order management, CRM, billing, etc.  As much as everyone likes to focus on product delivery, it’s the enterprise systems that run the business.
  • Create a long-term architectural vision to help guide the big, long-term investments.

Accelerated Velocity: Clarifying Processes and Key Roles

Note: this article is part 7 of a series called Accelerated Velocity.  This part can be read stand-alone, but I recommend that you read the earlier parts so as to have the overall context.

In a previous article I argued that great people are needed in order to get great results. To be clear, this theorem is asymmetric: great people don’t guarantee great results. Far from it – the history of sports, business and military is littered with the carcasses of “dream teams” that miserably underperformed.

No, there are several factors that need to be in place for teams to excel. The ability to take independent action, discussed in the previous article, is one of those factors. I’ll discuss others over the next few articles, starting here with clarity around processes and roles.

Even the best people have trouble reaching full potential if they don’t know what’s expected of them. True, some people are capable of jumping in and defining their own roles, but this is rare. Most will become increasingly frustrated, not knowing on any given day what’s expected of them and what they need to do to succeed.

People in teams also need to understand the conventions for how best to work with others. How they plan, collaborate, communicate status, and manage issues all play a part in defining how effective the team is. Too much, too little, or too wrong, and a high potential team will find itself hobbled.

The same applies to teams and teams-of-teams. Teams need clarity about their role within the larger organization. They also need common processes to facilitate working together in pursuit of common goals.

Popular software development methodologies provide the foundation for role and process clarity, with the “agile” family of methodologies being the de-facto norm. These frameworks typically come with default role definitions (e.g. scrum master, product owner) as well as best practices around processes and communications. When applied correctly they can be powerful force multipliers for teams, but adopting agile is not a trivial exercise.  In addition, these frameworks only cover a portion of the clarity that’s needed.

Bonial’s Evolution

Bonial in 2014 was maturing as an agile development shop, but there were gaps in role definitions, team processes, and inter-team collaboration that suppressed the team’s potential. Fortunately Bonial has always had an abundance of kaizen – restlessness and a desire to always improve – so people were hungry to change. No-one was particularly happy with the status quo and there was a high willingness to invest in making things better.

We rolled up our sleeves and got started…

We attacked this challenge along multiple vectors. First, we needed a process methodology that would not only guide teams but also provide tools for inter-team coordination and portfolio management. The product and engineering leadership teams chose the Scaled Agile Framework (SAFe) as as over-arching team, program and portfolio management methodology. It was not the perfect framework for Bonial but it was good enough to start with and addressed many of the most pressing challenges.

Second, we spent time more clearly defining the various agile roles and moving responsibilities to the right people. We started with the very basics as broken down in the following table:

Area of ResponsibilityRole Name (Stream / Team)Notes
What?Product Manager, Product Owner Ensures the the team is “Doing the right things”
Who and When?Engineering Manager, Team Lead Ensures that the team is healthy and “Doing things right” while minimizing time to market
How?Architect, Lead Developer Ensures that the team has architectural context and runway and is managing tech debt  

We created general role definitions for each position, purposely leaving space and flexibility for the people and teams to adapt as appropriate.  (I know many agile purists will feel their blood pressure going up after reading the table above, but I’m not a purist and this simplicity was effective in getting things started.) 

A quick side note here. One of the unintended consequences of any role definition is that they tend to create boxes around people. They become contracts where responsibilities not explicitly included are forbidden and the listed responsibilities become precious territory to guard and protect. I hate this, so I emphasized strongly that (a) role definitions are guidelines, not hard rules, and (b) the responsibility for mission success lies with the entire team, so it’s ok to be flexible so long as everything gets done.

Third, we augmented the team. We hired an experienced SAFe practitioner to lead our core value streams, organize and conduct training at all levels, and consult on best practices from team level scrum to enterprise level portfolio management. This was crucial; the classroom is a great place to get started, but it’s the day-to-day practice and reinforcement that makes you a pro.

Finally, we placed a lot of emphasis on retrospectives and flexibility. We learned and continually improved. We tried things, keeping those that succeeded and dropping those that failed. Over time, we evolved a methodology and framework that fit our size, culture and mission, eventually driving the massive increases in velocity and productivity that we see today.

Team Leads

There was one more big role definition gap that was causing a lot of confusion and that we needed to close: who takes care of the teams? While agile methodologies do a good job of defining the roles needed to get agile projects done, they don’t define roles needed to grow and run a healthy organization. For example, scrum has little to say regarding who hires, nurtures, mentors, and otherwise manages teams. Those functions are critical and need a clear home.

In Bonial engineering, we put these responsibilities on the “team lead” role. This role remains one of the most challenging, important and rewarding roles in Bonial’s engineering organization and includes the following responsibilities:

  • People
    • Recruiting
    • Personal development
    • Compensation administration
    • Morale and welfare
    • General management (e.g. vacation approvals)
    • Mentoring, counseling and, if needed, firing
  • Process
    • Effective lean practices
    • Efficient horizontal and vertical communications
    • Close collaboration with product owner (PO)
  • Technology
    • Architectural fitness (with support from the architecture team)
    • Operational SLAs and support (e.g. “On call”)
    • “Leading by example” – rolling up sleeves and helping out when appropriate
  • Delivery
    • Accountable for meeting OKRs
    • Responsible for efficient spend and cost tracking

That’s an imposing list of responsibilities, especially for a first-time manager. We’d be fools to thrust someone into this role with no support, so we start with an apprenticeship program – where possible, first time leads shadow a more experienced lead for several months, only taking on individual responsibilities when they’re ready. We also train new leads in the fundamentals of leadership, management and agile, and each lead has active and engaged support from their manager and HR. Finally, we give them room to both succeed, fail and learn.

So far this model has worked well. People tend to be nervous when first stepping into the role, but over time become more comfortable and thrive in their new responsibilities. The teams also appreciate this model. In fact, one of the downsides has been that it’s difficult to recruit into this role since it contains elements of traditional scrum master, team manager and engineering expert – a combination that is rare in the market. As such, we almost always promote into the role.

Closing Thoughts

In the end we know that no one methodology (or even a mashup of methodologies) will satisfy every contingency. To that end there are two important principles underpinning how we operate: flexibility and ownership. If something needs to be done, do it. Its great if the person who is assigned a given role does a full and perfect job, but in the end success is everyone’s responsibility, so it’s not an excuse if they can’t or won’t do it.

Some closing thoughts:
• People need to understand their roles and the expectations put on them to be most effective.
• Teams need to have a unifying process to facilitate collaboration and avoid chaos and waste.
• The overarching goal is team success; all members of the team should have that as their core role description.
• Flexibility is key. Methodologies are a means to an end, not the ends themselves.

Accelerated Velocity: Enabling Independant Action

Note: this article is part 6 of a series called Accelerated Velocity.  This part can be read stand-alone, but I recommend that you read the earlier parts so as to have the overall context.

Inefficiency drives me crazy.  Its like fingernails on a chalkboard.  When I’m the victim of an inefficient process, I can’t help but stew on the opportunity costs and become increasingly annoyed.  This sadly means I’m quite often annoyed since inefficiency seems to be the natural rest state for most processes.

There are lots of reasons why inefficiency is the norm, but in general they fall into one of the following categories:

1) Poor process design

2) Poor process execution

3) Entropy and chance

4) External dependencies

The good news in software development is that Lean/agile best practices and reference implementations cover process design (#1).  Process execution (#2) can likewise be helped by hiring great people and following agile best practices.  Entropy (#3) can’t, by definition, be eliminated but the effects can be mitigated by addressing the others effectively.

Which leaves us with the bane of efficient processes and operations: dependencies (#4). 

Simply put, a dependency is anything that needs to happen outside of the process/project in question in order for the process/project to proceed or complete.  For example, a software project team may require an API from another team before it can finish its feature.  Likewise a release may require certification by an external QA team before going to production.  In both cases, the external dependency is the point where the process will likely get stuck or become a bottleneck, often with ripple effects extending throughout the system.  The more dependencies, the more chances for disruption and delay.

So how does one reduce the impact of dependencies?

The simplest way is to remove the dependencies altogether.  Start by forming teams that are self-contained, aligned behind the same mission, and ideally report to the same overall boss.  Take, for example, the age-old divisions between product, development, QA, and operations.  If these four groups report to different managers with different agendas, then the only reasonable outcome will be pain.  So make it go away!  Put them all on the same team. Get them focussed on the same goals.  Give them all a stake in the overall success.

Second, distribute decision making and control.  Any central governance committee will be a chokepoint, and should only exist when (a) having a chokepoint is the goal, or (b) when the stakes are so high that there are literally no other options.  Otherwise push decision-making into the teams so that there is no wait time for decisions.  Senior management should provide overall strategic guidance and the teams should make tactical decisions.  (SAFe describes it well here.)

In 2014, Bonial carried a heavy burden of technical and organization dependencies and the result was near gridlock. 

At the time, engineering was divided into five teams (four development teams and one ops team), and each team had integrated QA and supporting ops.  So far, so good.  Unfortunately, the chokepoints in governance and the technical restrictions imposed by a shared, monolithic code-base effectively minimized independent action for most of the teams, resulting in one, large, inter-connected mega-team.

There was a mechanism known as “the roadmap committee” which was nominally responsible for product governance, but in practice it had little to do with roadmap and more to do with selective project oversight.  One of the roadmap committee policies held that nothing larger than a couple of days was technically allowed to be done without a blessing from this committee, so even relatively minor items languished in queues waiting for upcoming committee meetings.   

What little did make it through the committee ran directly into the buzzsaw of the monolith.  Nearly all Bonial software logic was embedded in a single large executable called “Portal3”.  Every change to the monolith had to be coordinated with every other team to ensure no breakage.  Every release required a full regression test of every enterprise system, even for small changes was on isolated components.   This resulted in a 3-4 day “release war-room” every two weeks that tied down both ops and the team unfortunate enough to be on duty. 

It was painful.  It was slow.  Everyone hated it.

We started where we had to – on the monolith.  Efforts had been underway for a year or more to gradually move functionality off of the beast, but it became increasingly clear with each passing quarter that the “slow and steady” approach was not going to bear fruit in a timeframe relevant to mere mortals. So our lead architect, Al, and I decided on a brute force approach: we assembled a crack team which took a chainsaw to the codebase, broke it up into reasonably sized components, and then put each component back together. Hats off to the team that executed this project – wading through a spaghetti of code dependencies with the added burden of Grails was no pleasant task.  But in a few months they were done and the benefits were felt immediately.

The breakup of the monolith enabled the different teams to release independently, so we dropped the “integrated release” process and each team tested and released on their own.  The first couple of rounds were rough but we quickly hit our stride.  Overall velocity immediately improved upon removing the massive waste of the dependent codebase and labor-intensive releases.

The breakup of the monolith also untethered the various team roadmaps, so around this time we aligned teams fully behind discreet areas of the business (“value streams” in SAFe parlance). We pushed decision making into the teams/streams, which became largely responsible for the execution of their roadmap with guidance from the executive team.  The “roadmap committee” was disbanded and strategic planning was intensified around the quarterly planning cycle.   It was, and still is, during the planning days each quarter that we identify, review and try to mitigate the major dependencies  between teams.  This visibility and awareness across all teams of the dependency risk is critical to managing the roadmap effectively.

Eventually we tried to take it to the next level – integrating online marketing and other go-to-market functions into vertically aligned product teams – but that didn’t go so well.  I’ll save that story for another day.

The breakup of the monolith and distribution of control probably had the biggest positive impact in unleashing the latent velocity of the teams.  The progress was visible.  As each quarter went by, I marveled at how much initiative the teams were showing and how this translated into increased motivation and velocity. 

To be sure, there were bumps and bruises along the way.  Some product and engineering leaders stepped up and some struggled.  Some teams adapted quickly and some resisted.  Several people left the team in part because this setup required far more initiative and ownership than they were comfortable with.  But in fairly short order this became the norm and our teams and leaders today would probably riot if I suggested going back to the old way of doing things.

Some closing thoughts:

  • Organize teams for self-sufficiency and minimal skill dependencies
  • Minimize or eliminate monoliths and shared ownership
  • Keep the interface as simple, generic and flexible as possible when implementing shared systems (e.g. APIs or backend business systems) 
  • Build transparent about dependencies and manage them closely

Accelerated Velocity: Growth Path

Note: this article is part 5 of a series called Accelerated Velocity.  This part can be read stand-alone, but I recommend that you read the earlier parts so as to have the overall context.

I recently heard that the average tenure engineers in tech companies is less than two years.  If true, it’s a mind-boggling critique on the tech industry.  What’s wrong with companies that can’t retain people for more than a year or two?  Seriously – who wants to work for a team where people aren’t around long enough to banter about the second season of Westworld?

I know there are many factors in play, especially in hot tech markets, but there’s one totally avoidable fault that is all too common: being stupid with growth opportunities. 

Software engineering is one of those fields where skills often increase exponentially with time, especially early in a career.  Unfortunately businesses seem loath to account for this growth in terms of new opportunities or increased compensation.   For example, companies set salaries at the time of hire and this is what the employee is stuck with for their tenure at the company – with the exception perhaps of an annual cost of living increase.  At the same time, the employee is gaining experience, adding to their skills portfolio, and generally compounding their market value.  Within a year or two the gap between their new market value and their actual compensation has grown quite large.  As most business shudder at the idea of giving large raises on a percentage basis, the gap continues to grow and the employee eventually makes the rational decisions to move to another company that will recognize their new market value, leaving the original company with en expensive gap in their workforce and massive loss in knowledge capital.

In addition, many companies take a highly individualist approach to compensation with a goal of getting maximum talent for the lowest price.  While this is textbook MBA, it fails in practice simply because it doesn’t take into account human psychology around relative inequality: when people feel they are not being treated fairly they get demotivated.  This purely free-market approach leads to a situation in which people doing the same work have massive disparities in compensation simply because some people are better negotiators than others.  The facts will eventually get out, leaving the person on the low end bitter and both people feeling like they can’t trust their own company.  This is a failing strategy in the long term.

This is what I’ve seen at most companies I’ve been in or around, and this was essentially the situation at Bonial in 2014.  There was a very high variance in compensation – on the extreme end we had a cases in which developers were being paid half the salary of other developers on the same team despite similar experience and skills.  Salaries were also static – the contract salary didn’t change unless the employee mustered the courage to renegotiate the contract.  The negotiation sessions themselves were no treat for either the employee or their manager – in the absence of any framework they were essentially contests of wills, generally leaving both parties unsatisfied.

So we set out to develop a system that would facilitate a career path and maintain relative fairness across the organization.  We modeled it on a framework I’d developed previously which can be visualized as follows:

Basically, as a person gains experience (heading from bottom-left to top-right) they earn the chance to be promoted, which comes with higher compensation but also higher expectations.  They can also explore both technical specialist and management tracks as they become more senior, and even move back and forth between them.

The hallmarks of this system are:

      1. Systematic: Compensation is guided by domain skills – actual contributions to the business and market value – not on negotiation skills. 
      1. Fair: People at the same career/skill level will be compensated similarly.
      1. Regular: Conversation about career level and compensation happens at least once per year, initiated by the company. 
      1. Motivational: People have an understanding of what they need to demonstrate to be promoted. 
    1. Flexible: People have three avenues for increased compensation:
        • Raises – modest boosts in compensation for growth within their current career level based on solid performance.  This happens in between promotions.
      • Promotions – increases to compensation based on an employee qualifying for the next career level (with increased expectations and responsibilities).  This is where the big increases are and what everyone should be striving for.
      • Market increases – increases due to adjustment of the entire salary band based on an evaluation of the general market.

From a management perspective, this system also has some additional upsides:

    • Easy to budget.  Instead of planning with names and specific salaries, one can build a budget based on headcount of certain skill/levels. 
    • Easy to adjust.  If the team decides it needs a mobile developer or a test automator instead of a backend developer, for example, it simply trades one of it’s authorized positions for one of a similar value.  Likewise it can shift around seniority as needed to meet its goals.
  • Mechanism for feedback.  By reserving promotions and raises for the deserving contributors, this system provides an implicit feedback mechanism.

So far the system seems to be working well at Bonial, measured as much by what isn’t happening as what is.  For example, people who have left the team seldom call out compensation as their primary motivator.  We’ve also had few complaints about people feeling they are not being paid fairly compared to their peers.  

As a side note, we conduct regular employee satisfaction surveys and ask how employees feel about their compensation.  Interestingly, their responses on their feeling about compensation vs market do not strongly correlate with their overall satisfaction.  What does correlate?  Their projects, the tech they work with, their growth opportunities, the competence of their team mates, and their leads.  So these are the areas we have and will continue to invest in.

Some closing thoughts:

    • Professionals want to know they are being compensated fairly both within the company and within the market.  That way they can focus on what they’re creating, not be worried about their pay.
  • Professionals want the opportunity to grow and to be recognized (and rewarded) for their growth.  Providing a growth path inside the company improves employee retention and reduces costs related to talent flight.
  • Compensation is an asymmetric demotivator.  Low or unfair compensation will demotivate, but overly high compensation isn’t generally a motivator.  So make sure you’re out of the “demotivating” range and then focus on key motivators, especially in the area of day-to-day satisfaction.

Accelerated Velocity: Situational Awareness

Note: this article is part 4 of a series called Accelerated Velocity.  This part can be read stand-alone, but I recommend that you read the earlier parts so as to have the overall context.

“If a product or system chokes and it’s not being monitored, will anyone notice?”  Unlike the classic thought experiment, this tech version has a clear answer: yes.  Users will notice, customers will notice, and eventually your whole business will notice.

No-one wants their first sign of trouble to be customer complaints or a downturn in the business, so smart teams invest in developing “situational awareness.” What’s that?  Simple – situation awareness is the result of having access to the tools, data and information needed to understand and act on all of the moving factors relating to the “situation.”  This term is often used in the context of crisis situations or other fast-paced, high-risk endeavors, but it applies to business and network operations as well.

Product development teams most definitely need situational awareness.  The product managers and development leads need to know what their users are doing and how their systems are performing in order to make wise decisions – for example, should the next iteration focus on features, scale or stability.  Sadly, these same product teams often see the tracking and monitoring that is needed for developing situational awareness as “nice-to-have’s” or something to be added when the mythical “someday” arrives. 

The result?  Users having good or bad experiences and no-one knowing either way.  Product strategy decisions being made on individual bias, intuition and incomplete snippets of information.  Not good.

Sun Tzu put it succinctly:

“If you know neither the enemy nor yourself, you will succumb in every battle.”

Situational awareness is a huge topic, so in this series I’m going to limit my focus to data collection (tracking and monitoring) and insights (analytics and visualization) at the product team level.  For the purposes of this series I’ll define ”tracking” as the data and tools that show what users/customers are doing and “monitoring” as the data and tools that focus on systems stability are performance.  Likewise I’ll use “analytics” to refer to tools that facilitate the conversion of data into usable intelligence and “visualization” as the tools for making that intelligence available to the right people at the right time.  I’ll cover monitoring in this article and tracking in a later article.

At Bonial in 2014 there was a feeling that things were fine – the software systems seemed to be reasonably stable and the users appeared happy.  Revenue was strong and the few broad indicators viewed by management seemed healthy.  Why worry?   

From a system stability and product evolution perspective it turns out there was plenty of reason to worry.  While some system-level monitoring was in place, there was little visibility into application performance, product availability or user experience.  Likewise our behavioral tracking was essentially limited to billing events and aggregated results in Google Analytics.  Perhaps most concerning: one of the primary metrics we had for feature success or failure was app store ratings.  Hmmm.

I wasn’t comfortable with this state of affairs.  I decided to start improving situational awareness around system health so I worked with Julius, our head of operations, to lay out a plan of attack.  We already had Icinga running at the system level as well as DataDog and Site24x7 running on a few applications – but they didn’t consistently answer the most fundamental question: “are our users having a good experience?” 

So we took some simple steps like adding new data collectors at critical points in the application stack.  Since full situation awareness requires that the insights be available to the right people at the right time, we also installed large screens around the office that showed a realtime stream of the most important metrics.  And then we looked at them (a surprisingly challenging final step). 

The Bonial NOC Monitor Wall
One of my “go to” overviews of critical APIs, showing two significant problems during the previous day.

The initial results weren’t pretty.  With additional visibility we discovered that the system was experiencing frequent degradations and outages.  In addition, we were regularly killing our own systems by overloading them with massive online marketing campaigns (for which we coined the term: “Self Denial of Service” or SDoS).  Our users were definitely not having the experience we wanted to provide.

(A funny side note: with the advent of monitoring and transparency, people started to ask: “why has the system become so unstable?”)

We had no choice but to respond aggressively.  We set up more effective alerting schemes as well as processes for handling alerts and dealing with outages.  Over time, we essentially set up a network operations center (NOC) with the primary responsibility of monitoring the systems and responding immediately to issues.  Though exhausting for those in the NOC (thank you), it was incredibly effective.  Eventually we transferred responsibility for incident detection and response to the teams (“you build it you run it”) who then carried the torch forward.

Over the better part of the next year we invested enormous effort into triaging the immediate issues and then making design and architecture changes to fix the underlying problems.  This was very expensive as we tapped our best engineers for this mission.  But over time daily issues became weekly became monthly.  Disruptions became less frequent and planning could be done with reasonable confidence as to the availability of engineers.  Monitoring shifted from being an early warning system to a tool for continuous improvement. 

As the year went on the stable system freed up our engineers to work on new capabilities instead of responding to outages.  This in turn became a massive contributor to our accelerated velocity.  Subsequent years were much the same – with continued investment in both awareness and tools for response, we confidently set and measure aggressive SLAs.  Our regular investment in this area massively reduced disruption.  We would never have been able to get as fast as we are today had we not made this investment.

We’ve made a lot of progress in situational awareness around our systems, but we still have a long way to go.  Despite the painful journey we’ve taken, it boggles my mind that some of our teams still push monitoring and tracking down the priority list in favor of “going fast”.  And we still have blind spots in our monitoring and alerting that allow edge-case issues – some very painful – to remain undetected.  But we learn and get better every time.

Some closing thoughts:

  • Ensuring sufficient situational awareness must be your top priority.  Teams can’t fix problems that they don’t know about.
  • Monitoring is not an afterthought.  SLAs and associated monitoring should be a required non-functional requirement (NFR) for every feature and project.
  • Don’t allow pain to persist – if there’s a big problem, invest aggressively in fixing it now.  If you don’t you’ll just compound the problem and demoralize your team.
  • Lead by example.  Know the system better than anyone else on the team.

In case you’re interested, here are some of the workhorses of our monitoring suite:

Accelerated Velocity: Building Great Teams

Note: this article is part 3 of a series called Accelerated Velocity.  This part can be read stand-alone, but I recommend that you read the earlier parts so as to have the overall context.

People working in teams are at the heart of every company.  Great companies have great people working in high performing teams.  Companies without great people will find it very difficult to get exceptional results. 

The harsh reality is that there aren’t that many great people to go around.  This results in competition for top talent, which is especially true in tech.  Companies and organizations use diverse strategies in addressing this challenge.  Some use their considerable resources (e.g. cash) to buy top talent though with dubious results – think big corporations and Wall Street banks.  Some create environments that are very attractive to the type of people they’re looking for – think Google and Amazon.  Some purposely start with inexperienced but promising people and develop their own talent – a strategy used by the big consulting companies.  Many drop out of the race altogether and settle for average or worse (and then hire the consulting companies to try to solve their challenges with processes and technology – which is great for the consulting companies).

But attracting talent is only half the battle.  Companies that succeed in hiring solid performers then have to ensure their people are in a position to perform, and this brings us to their teams.  Teams have a massive amplifying affect on the quantity and quality of each individual’s output.  My gut tells me that the same person working on two different teams may be 2-3X as productive depending on the quality of the team. 

So no matter how good a company is at attracting top talent, it then needs to ensure that the talent operates in healthy teams. 

What is a healthy team?  From my experience it looks something like this:

  • Competent, motivated people who are…
  • Equipped to succeed and operate with…
  • High integrity and professionalism…
  • Aligned behind a mission / vision

That doesn’t seem too hard.  So why aren’t healthy teams the norm?  Simple: because they’re fragile.  If any of the above pieces are missing, the integrity of the team is at risk.  Throw in tolerance for low performers, arrogant assholes, and whiners, mix in some disrespect and fear, and the team is broken.

(Note that the negatives influences outweigh the positives – as the proverb says: “One bad apple spoils the whole bushel.”  If you play sports you know this phenomenon well – a team full of solid players can easily be undone by a single weak link that disrupts the integrity of the team.)

This leads me to a few basic rules I follow when developing teams:

  1. Provide solid leadership
  2. Recruit selectively
  3. Invest in growth and development
  4. Break down barriers to getting and keeping good people
  5. Aggressively address low-performance and disruption

Bonial had a young team with a wide range of skill and experience in 2014.  Fortunately many of the team members had a bounty of raw talent and were motivated (or desired to be motivated).  Unfortunately there were also quite a few under-performers as well as some highly negative and disruptive personalities in the mix.  The combination of inexperience, underperformance and disruption had an amplifying downward effect on the teams.

To build confidence and start accelerating performance we needed to turn this situation around.  We started by counseling and, if behavior didn’t change, letting go the most egregiously low performers and disruptive people – not an easy thing to do and somewhat frowned upon in both the company and in German culture.   But the cost of keeping them on the team, thereby neutralizing and demoralizing the high performers, was far higher than the pain and cost of letting them go. 

(A quick side note: there were concerns among the management that letting low-performers go would demoralize the rest of the team.  Not surprisingly, quite the opposite happened – the teams were relieved to have the burdens lifted and were encouraged to know that their leads were committed to building high performing teams.)

We started doing a better job of mentoring people and setting clear performance goals.  Many thrived with guidance and coaching; some didn’t and we often mutually decided to part ways.  Over time the culture changed to where low performance and negativity were no longer tolerated.

At the same time we invested heavily in recruiting.  We hired dedicated internal recruiters specifically focussed on tech recruits.  We overhauled our recruiting and interview process to better screen for the talent, mentality and personality we needed.  We added rigor to our senior hiring practices, focussing more on assessing what the person can do vs what they say they can do.  And we added structure to the six month “probation” period, placing and enforcing gates throughout the process to ensure we’d hired the right people.  Finally, we learned the hard way that settling for mediocre candidates was not the path to success; it was far better to leave a position unfilled than to fill it with the wrong person.

How did we attract great candidates?  We focussed on our strengths and on attracting people who valued those attributes: opportunities for growth, freedom to make a substantial impact, competent team-mates, camaraderie, a culture of respect, and exposure to cutting-edge technologies.  Why these?  Because year over year, though employee satisfaction survey and direct feedback, we find these elements correlate very strongly with employee satisfaction, even more so than compensation and other benefits.  In short, we’ve worked hard to create an environment where our team-mates are excited to come to work every day.

(This is not to say we ignored competitive compensation; as I’ll describe in a later post, we also worked to ensure we paid a fair market salary and then provided a path for increasing compensation over time with experience.)

Over time, as our people became more experienced, our processes matured and our technology set became more advanced, Bonial became a great place for tech professionals to sharpen their skills and hone their craft.  New team members brought fresh ideas and at the same time had an opportunity to learn both by what we already had as well as what they helped create.  The result is what we have today: a team of teams full of capable professionals who are together performing at a level many times higher than in 2014

Some closing thoughts:

  • You’re only as good as the people on the teams.
  • Nurture and grow talented people. Help under-performers to perform. Let people go when necessary.
  • Get really good at recruiting.  Focus on what the candidate will do for you vs what they claim to have done in the past.
  • Don’t fall into the trap of believing process and tools are a substitute for good people.

Footnote: If you haven’t yet, I suggest your read about Google’s insightful research on team performance and how  ”psychological safety” is critical to developing high performing teams.