Conversations with Amazon Alexa

(Warning: this article will delve into technical design and code topics – if you’re not in the mood to geek-out you might want to skip this one.)
 
I’m excited about Alexa and it’s siblings in the voice assistant space – the conversational hands-free model will facilitate “micro moment” interactions to a degree that even mobile apps couldn’t do.  These new apps and interactions can be quite powerful, but as the saying goes – “with great power comes great responsibility.”  In this case the responsibility is to build voice interfaces that don’t suck, and that’s not trivial.  We’ve all used a bank or airline automated systems that have infuriated us, either by being confusing, a waste of time, or by leaving us stuck in “IVR hell” unable to understand or get us to where we want to be.
 
Fortunately there are solutions.  First, there is a UX specialty know as Voice User Interface Design (VUI Design) who’s practitioners are highly skilled in the art, science, psychology, sociology and linguistics required to craft quality speech interactions.  Unfortunately they are rare and will likely be in extremely high demand as voice assistant skills blossom.
 
Second, there are online frameworks for developing speech interactions that fill much the same role as bumpers at the bowling alley – they won’t make you a better bowler, but they’ll protect you from some of the most egregious mistakes.  Perhaps the best tool on the market today is API.AI, which is primarily a natural language interpretation engine that can be the brains behind a variety of conversation interfaces – chat bots like Facebook Messenger and Telegram, voice assistants like Google Home, etc.
 
The Alexa ADK also comes with an online tool for developing interactions, but it’s quite primitive and cumbersome to use for anything but the simplest of skills.  Probably the biggest gap in the ADK is the lack of support for “slot filling”.  Slot filling is what speech interfaces do when they don’t get all the info needed to complete a task.  For example, let’s say you’re developing a movie ticket purchase skill.  In a perfect world every user would properly say, “I’d like two adult tickets to the 5:00 PM showing of Star Wars today.”  Given that our users will be rude and not behave the way we want them to, it’s likely they’ll say something like, “I want two tickets to Star Wars.”  It’s our skill’s responsibility to discover the [ ticket type ], [ showtime ], and [ show date ].  Our skill would likely next as the user: “How many tickets do you want to buy?” and so on.  That’s slot filling.
 
Alexa provides no native tools for managing slot filling, so it’s left to the developer to implement the functionality on their own service (which Alexa calls via “web hooks”.  Here’s an approach we use here at Bonial:
 
  • Create a Conversation object (AlexaConversation) that encapsulates the current state of the dialog and the business logic for determining next steps.  The constructor takes the request model from Alexa, which includes a “Session” context.   Conversations expose three methods:
    1. get_status() – whether the current dialog is complete or not
    2. get_next_slot() – if the dialog is not complete, which slot needs to be filled next
    3. get_session_context() – the new session context JSON to be sent back to Alexa (and then returned to the app on the next call) – basically the dialog state
class Conversation:
    __metaclass__ = ABCMeta

    model = None
    status = None
    type = None

    # pass in the underlying model or data needed to assess the current state of the dialog
    def __init__(self, model):
        self.model = model

    @abstractmethod
    def get_status(self):
        None

    @abstractmethod
    def get_next_slot(self):
        None

    @abstractmethod
    def get_session_context(self):
        None
  • When a request from Alexa arrives, we simply create an AlexaConversation with the request JSON and ask whether the current dialog is complete or not.  If it is complete, we then pass the dialog to the business logic layer for interpretation and processing (more in this later).  If not complete we respond to Alexa with a prompt to ask for the next slot.  Repeat.
 
So far it’s working well and reduces the complexity of the processing code.  Unfortunately both the dialog rules (how many slots, which are required, which order) is in the code, as are the slot prompts.  Are next step will be to move both of these into a declarative format so the VUI designers will have the flexibility to edit without involving the coders.
 
We assume this will be a stop-gap until the ASK and other resources have proper slot-filling capabilities.  We’d also love to hear how you’re approaching this challenge.

What a difference a decade makes…

I frequently fly transatlantic as part of my job.  Over the past few years I’ve been excited to see airlines (Delta, Lufthansa, Air Berlin) begin to offer two things: (1) in seat AC power and (2) internet access throughout the flight.  Now I can run my laptop the entire flight and, or daytime flights, stay connected with my team back in Berlin.
 
Last week I was fortunate to be rerouted from a Delta codeshare KLM fight (no power, no internet) onto Lufthansa (power, internet).  On the daytime flight from Frankfurt to Chicago I spent nine hours of blissful time catching up on a ton of work that required online access.  I was able to slack with my team the whole time, send emails, and work on shared documents.  At one point, I was working on a prototype of a voice assistant project – the IDE was running on my laptop and deploying code to Heroku, I was using API.AI to develop the natural language interface, and used Amazon Alexa ADK to generate sample Alexa calls.  Traffic was constantly flowing between all of the nodes.  All from my seat on the plane.
 
Ten years ago we didn’t have smart phones.  We were just a few years past modems.  Streaming media was mostly a dream. There certainly wasn’t wifi on planes.
 
The jury is still out whether I’ll miss the eight hours of uninterrupted quiet time on planes bing-watching of movies that I probably didn’t want to see – there’s certainly something to be said for being unplugged.  But I sure as heck like the option to stay connected.
 
What a difference a decade makes.  It makes me wonder what the net decade will bring.  Can’t wait – should be a wild ride.

Extreme Ownership, Tech Style

I recently read “Extreme Ownership,” a popular read on leadership by former Navy SEAL officers.  The core premise of the book is that a leader must fully own the results (good or bad) of their results if they are to create and lead a successful team.  Accountability is key, even in cases (or especially in cases) in which events are out of one’s direct control – the leader is responsible for ensuring that everyone in their organization has the context and competence to succeed, even to the point getting rid of underperforming team members when necessary.  There are no excuses.

How I wish I could find more of this in the tech domain.

I have limited experience in non-tech industries so I can’t say whether it’s better or worse elsewhere, but techies love their excuses.  When I was consulting I called it the “Any and All Excuses Accepted Here” phenomenon.  I’ve lost track of the number of status reports (standups, etc) in which someone reports that their task or project is late and everyone (leadership included) just nods at the excuses and moves on.  Perhaps a developer got sick or another team didn’t deliver on time.  Maybe there was a massive network outage that blocked access to servers or critical services. In truth the challenges are legitimate, but so what?  I rarely see the person who owns the outcome and explains what they’re going to do to make things right. 

Why is this attitude important?  Simple: as one of my mentors used to say, the market doesn’t give a damn about your excuses.  Either you deliver and win or you don’t.  

There are certainly companies who are much less tolerant of excuses in their relentless pursuit of market leadership, however that doesn’t mean their leaders actually embrace the concept of Extreme Ownership.  Many of these companies have cultures in which blame replaces excuses and leaders throw their subordinates or peers under the bus.  Shit rolls downhill.  The culture quickly becomes toxic and, while the short-term business results may be impressive and the investors are happy, the people responsible for delivering the success work in fear under weak leaders.

So how do we fix this?

It starts with you.  If you’re a leader in your organization you must embrace Extreme Ownership yourself if you want the rest of the organization to follow suit.  Once you do, you’ll find that it becomes contagious and spreads quickly throughout the team/s.

Getting into the Extreme Ownership mindset takes work.  Start here: the next time your team fails, resist the urge to make any excuses or to pounce on the person who screwed up.  First ask yourself the question: “What could I have done to get a different result?”  Then make it right if at all possible. Own up fully and personally to the failed result and set about doing what you can to make sure it doesn’t happen again.  Sometimes it’ll involve better communications; sometimes more training.  Usually it will require hard thinking in how to do things better.  Often it will need hard conversations about individual performance and in extreme cases the removal of people who simply can’t fulfill their team duties.  The latter is tough and is a last resort, but is necessary to ensure the health of the team.

Creating a culture of ownership is not enough – training is also needed.  Let’s face it – most people are not natural-born leaders. But I believe, and my experience has shown me, that most people can learn to be solid leaders.  As leadership has strong components of science and psychology it lends itself well to training.  We do a great disservice to our industry by thrusting new leaders into roles without any training or support (or worse, sending them to bland corporate boilerplate training).  More on this in a later blog. 

In closing – read the book (preferably with your team) or listen to the interview on Tim Ferriss’ podcast and start adopting the principles.  You won’t regret it.  

Own it!

Here to Stay

As we slide into 2017, speech recognition is all the rage – it was the darling of CES and you can’t pick up a business or tech journal without reading about the phenomenon.  The sudden explosion of interest and coverage could lead one to assume this is yet another hype bubble waiting to burst.  I don’t think so, and I’ll explain why below.  But first let’s roll back the calendar and look at the evolution of the technology… 2016… 2015… 2014… 2010… 2005…

In the late 1990’s and early 2000’s, speech recognition was a hot topic.  Powered by early versions of machine learning algorithms and improvements in processing power, and boosted by the advent of VoiceXML which brought web methods to voice, pundits preached of the day when a “voice web” would rise to rival the dot-com bubble.

It never happened.

Implementors quickly found an Achille’s heel in speech interfaces: the single-dimensional information stream provided by audio was no match for two-dimensional visual presentation of the web and apps – it was simply too cumbersome to consume large quantities of information via audio.  This relegated speech to “desperation” scenarios where visual simply wasn’t an option or to human automation scenarios (e.g. call centers).

Fast forward a decade and a half.  Siri which, for all that its been maligned, was a watershed moment for speech.  It came with reasonable accuracy and with a set of natural use cases (hand free driving, message dictation, cold-weather hands-free operation, etc.).  It took speech mainstream.

What Siri started, Amazon Echo took to the next level.  Rather than requiring the user to interrupt the natural flow of their lives to fiddle with their phone, Alexa is always on and ready to go (so long as you’re near it, of course).  This means Alexa enables Micro-Moments and integrates into one’s normal flow of life.  The importance of that can’t be understated.

Over the last six months other tech giants have started falling over themselves to respond to the market success of Echo and the surprising stats coming in from the market: 20% of mobile searches via speech, 42% of people using voice assistants, etc.  Google recently released “Home” and is plugging Assistant into its Pixel phone and other devices.  Facebook and others and trailing close behind.  And Apple is working to regain it’s early lead by freeing Siri from the confines of the phone and laptop.

So where’s it all going?

To speculate on that we should probably look at why consumer speech recognition is succeeding this time.  First, improvements in processing power and neural network / “deep learning” algorithms dropped the cost and radically improved the accuracy of speech recognition.  This has allowed speech + AI to subtly creep into more and more user facing apps (e.g. Google Translate) which both conditioned users as well as helped train the speech engines.  The technology is still limited to single-dimensional streams, but the enormous popularity of chat and, more recently, bots shows that there is plenty of attraction to this single dimension.

But speech is still limited – for example the best recognition engines need a network connection to use cloud resources and noisy environments common to cityscapes continue to confound recognition engines.  This is why the Echo approach is genius – an always-on device with a great microphone in a (somewhat) quite environment.  But will speech move beyond the use case of playing music or getting the weather?

Yes.  Advanced headphones like the Apple AirPods will expand “always on” beyond the home.  Improved algorithms will handle noisy environments.  But perhaps most important – multi-modal interfaces are now eminently possible.

What’s multi-modal?  Basically an interaction that spans multiple interfaces simultaneously.  For example, you may start an interaction via voice but complete it via a mobile device – like asking your voice assistant to read your email headers but then forwarding an email of interest to the nearest screen to be read and responded to.  Fifteen years ago there simple weren’t too many options for bouncing between speech and graphical interfaces.  Today, the ubiquity of connected smart phones changes the equation.

Will this be the last big wave of speech?  No.  Until the speech interface is backed by full AI it can’t reach it’s full potential.  Likewise there’s still a lot of runway in terms of interface devices.  But this time it’s here to stay.

Simplify Simplify Simplify

Near the end of my first year in my first development job at Andersen Consulting (now Accenture) I was handed an exciting new project: to develop a Monitoring And Reporting Server for a major bank’s financial systems.  And to do it in a few weeks.

This was the mid-1990’s – in the very early days of the web and long before the ELK stack and other tools that would have made this pretty straightforward.  In addition, we were developing this with C/C++ and had to write it all from scratch as open source was still in its infancy and Java and C# were still to come.  

But what software engineer doesn’t like these kinds of challenges?  I was feeling pretty good about myself – I’d had a couple of solid wins up to this point, so in some ways this project was a nod to my success so far and a challenge to step up from pure coding to design and architecture.

Under the tutelage of my supervisor at the time I sank my teeth into it.  Given the challenges – no open source, unforgiving programming environments, limited compute and memory, limited time, etc. – you’d think I’d try to keep it simple.  Right.  On paper (yes, we still used paper) I created an incredibly sophisticated design with input channels (“Translators” in design pattern parlance, but even design patterns were in their infancy), normalization layers, storage mechanisms, and “bots” to look for anomalies.  It had self monitoring and auditing and adapters for new pluggable streams.  For good measure I designed it all to interoperate with other systems and technologies using a (complex and slow) CORBA ORB.  I abstracted everything just in case some future unknown requirement might require extensions or adaptions.  I was very proud of it.

It was never completed.  

Thank goodness.  While I was disappointed at the time, I realize now that this creation was designed to be a total monstrosity and likely a failure.  Soon into the coding I was already bogged down with massive issues in keeping track of the (hand written) thread pools as well as challenges managing the complexity of the modular system builds.  Had it been completed it would have taken a team of rocket scientists to understand and maintain what was, in the end, a pretty simple concept.  Fortunately another project came along with higher priority and I was put on that.  (Plus, though it was never said, I think my supervisor knew deep down that my approach was leading us to a bad place.)

What went wrong?  Easy – I’d violated the first rule of software architecture: SIMPLIFY SIMPLIFY SIMPLIFY. 

Duh!  Right?  Anyone in the field for more than a few seasons knows that more complexity leads to higher risk, more bugs, longer development times and greater total cost of ownership (TCO).  But we do it anyway.  The wisdom gets lost in the excitement of creating something big and beautiful.

It happens all the time. Even with all of the “bumpers” to keep us honest these days – solid design patterns built into many of the open source frameworks, Lean software practices that naturally select for simplicity – I still see it violated on a regular basis.  I spent several years consulting for large corporations, largely fixing violations to this rule.  Even today some of the biggest challenges my team faces at Bonial come from systems and projects that were simply too complex for the work they did.  

So, how do you implement this principle in practice?  There’s no magic – you just challenge yourself or, better yet, have the team challenge you.  When you’ve come up with your approach or design, ask yourself, “Ok, now how can I make this 30% simpler?”  Look for over-design especially abstractions and facades that aren’t really needed or could be added later.  Challenge the need for all concurrency implementations especially when app servers are in play.  Look hard at the sheer number of moving pieces and ask whether one object / module / server will suffice?  And always look for software, best practices and frameworks that already do what you’re trying to do – the simplest thing is not having to do it at all. 

Hello World…

Do you remember the point at which you knew you’d crossed over from being a kid to a grown up?  When you realized that you were now one of those older / wiser / more successful professionals that used to be a league apart?  
 
I have to admit it was bit of a shock when I slowly woke up to the fact that I was now “that” person.  I didn’t really feel any different, but everyone else seemed to think I was.  And the truth is I was different.  I knew things that others hadn’t learned yet, my instincts were honed by successes and failures, I had a wealth of experiences to draw on and share, and I could connect the dots better simply because I had more dots to connect.
 
With this in mind I have a lot of mentors to thank for sharing their experiences and knowledge with me, and I owe it to the next generation to share what I’ve learned.  Hence this blog.
 
So what have I learned?  Well, for starters, a bit about technology.  After studying biochem and serving as a submariner I moved into the software space where I’ve run the table from coder to CTO.  I know software, hardware, and networks as well as data storage, transfer and analysis.  I’ve been fortunate to be heavily involved in business activities, including one IPO, so I have a working knowledge of decidedly non-technical topics like sales and marketing, financial modeling, business operations and product management.  Finally as I’ve been in leadership roles for my entire adult life, I’ve developed a deep appreciation for the fact that businesses, even tech businesses, are about the people first (and the people that lead them).
 
This will be an “agile” blog, adapting as needed given the events in tech and markets.  There will be some war stories, opinions, lessons learned and current events.  I’ll break them down into core themes – tech, leadership, business, general – so you can follow whichever threads interest you the most.  
 
So please subscribe to the blog, comment often with questions or your own experiences, and enjoy!