What I learned from writing an AI voice assistant and chat bot

I have a confession: despite being in management I still love to code.  Since I don’t get to program as much as I’d like or stay up on the latest trends and technologies, I set a goal for myself to learn at least one new technology every year (and more than one on a good year).  This learning hobby is how I made the leap from back-end to full-stack developer, how I learned iOS and Android, and how I stepped into the hallowed halls of Data Science.

This year I decided to explore chat bots and voice assistants.  As I learn best by doing, I generally think up a fun or useful project and then learn through building it.  For this project I decided to tackle an unending source of stress in my household: bickering and arguing over screen time for our kids.  

Enter ChronosBot

The idea behind ChronosBot is simple.  Parents set up screentime accounts for each child as well as an an automatic allowance that puts time in the accounts.  After linking their account to Alexa, Google Assistant, Facebook Messenger, etc., they can say or write things like, “Alexa, ask ChronosBot to withdraw 30 minutes from Axel’s account” or “… what’s everyone’s balance?”

With the idea in place, I had to choose my tech stack.  Google has a robust platform built on API.AI.  API.AI supports a dozen or so chat integrations (Allo, Messenger, Telegram, Kik, etc.) as well as a voice interface for Google Home, allowing developers to (theoretically) write one interface for both voice and chat.  At the time I started, Amazon Alexa had a rudimentary platform for speech dialog development using structures text.  In both platforms the interface designer creates “intents” that match what the user says to something the bot can do and then provides appropriate responses, and both platforms hand off the business logic to a backend app using web hooks.

For the backend, I decided to sharpen my python skills and implement in Django on top of Postgres.  For deployment I decided to give Heroku a try.  

Development of the basic use cases took me a couple of weeks in the late evening and weekends.  I submitted to both Amazon and Google and waited for a week or so in each case for the review.  Both rejected my app, but for reasons that I hadn’t expected.  Amazon told me that my app violated the Alexa ToCs because it “targeted children” (huh?) and told me to not resubmit the app ever again (seems they relented).  Google gave me the boot because my invocation name couldn’t be recognized properly but a very helpful person from Google worked with me to resolve the issue and now it’s live.  

I’ve since continued development and added new features like “rewards” and “penalties” (requested by my wife) and “mystery bonus” (requested by the kids).  I’ve enabled Telegram and Messenger and have adapted the platform to support both visual and audio surfaces.  And the Alexa version was finally approved earlier this week.

Lessons Learned

So, what have I learned while navigating the ins and outs of the Google and Alexa development platform and publication process? 

1)  Amazon and Google have very different approaches.  Google has taken the bold approach of enabling all community developed actions and using an intent matching algorithm to route users to the correct action.  Amazon requires users to enable specific skills via a Skill Store.  In both cases, discovery is a largely unsolved challenge.

2)  Too early to tell who will be king.  Amazon Alexa has a crazy head start, but Google seems to be a more robust speech development platform.   With a zillion Android devices already on the market one certainly can’t count them out.  On the other hand not a month seems to go by without a new Alexa form factor hitting the market.  

3)  It’s early days.  Both platforms are being developed at a lightning fast pace.  Google had a big head start with API.AI.  The original Alexa interface was frustratingly primitive, but they’ve since upgraded to a new UI (which suspiciously bears a strong resemblance to API.AI) that has great promise.  

I have to take my hat off to both companies for creating a paradigm and ecosystem that makes voice assistant and natural language development accessible to the broader development community.  It’s so straight forward that even my kids gave it a try – my daughter (10) developed “The Oracle” that answers deeply profound questions like “Who’s awesome” (she’s awesome).  My son (12) wrote a math quiz game with which he is happy to challenge anyone to beat his top score. 

4)  Conversational UX is easy; good conversational UX is really hard.  I’ve known this since I was involved with Nuance and the voice web in the late 1990’s (and I also happen to be married to an expert in the space).  Making it easy to build a conversational UX is a very different thing than helping developers build a high quality conversational UX (especially a Voice UX).  Both Amazon and Google have tried to address this with volumes of best practice documentation, but I expect most developers will ignore it.

5)  Conversational UX is limited.  There are some use cases that work for serial interactions (voice or chat) and some that work better in parallel interactions (visual).  Trying to force one into the other typically doesn’t make sense or only applies to “desperate users”.  You see the effect of this to some degree already in the Alexa Skill Store – there are some clear clusters evolving (home automation, information retrieval, quiz games).

6)  Multi-modal UX is the next natural step.  I’m very excited about the Amazon Echo Show as I expect that will unleash a wave of interesting multi-modal interaction paradigms.  

7)  It’s fun.  There’s just something about the natural language element of voice assistants that allows for a richer, more human interaction than what GUIs can provide.  

All in all I’m really excited about the potential of this space, and I’m not alone – just look at the growth of the Alexa Skills Store.  The tech press is also taking a critical look at these capabilities (e.g. a recent article featuring yours truly) and I expect most companies are at least thinking about how these capabilities will play in their business.  My company, Bonial, is investing in several actions/skills to explore the potential of voice and chat interfaces.  To date we’ve already launched a bot that allows users to search for local deals and will shortly launch a voice assistant interface to our shopping list app, Out of Milk.  We’ve learned a lot and we’ll share more on those projects in other posts.  

Conversations with Amazon Alexa

(Warning: this article will delve into technical design and code topics – if you’re not in the mood to geek-out you might want to skip this one.)
 
I’m excited about Alexa and it’s siblings in the voice assistant space – the conversational hands-free model will facilitate “micro moment” interactions to a degree that even mobile apps couldn’t do.  These new apps and interactions can be quite powerful, but as the saying goes – “with great power comes great responsibility.”  In this case the responsibility is to build voice interfaces that don’t suck, and that’s not trivial.  We’ve all used a bank or airline automated systems that have infuriated us, either by being confusing, a waste of time, or by leaving us stuck in “IVR hell” unable to understand or get us to where we want to be.
 
Fortunately there are solutions.  First, there is a UX specialty know as Voice User Interface Design (VUI Design) who’s practitioners are highly skilled in the art, science, psychology, sociology and linguistics required to craft quality speech interactions.  Unfortunately they are rare and will likely be in extremely high demand as voice assistant skills blossom.
 
Second, there are online frameworks for developing speech interactions that fill much the same role as bumpers at the bowling alley – they won’t make you a better bowler, but they’ll protect you from some of the most egregious mistakes.  Perhaps the best tool on the market today is API.AI, which is primarily a natural language interpretation engine that can be the brains behind a variety of conversation interfaces – chat bots like Facebook Messenger and Telegram, voice assistants like Google Home, etc.
 
The Alexa ADK also comes with an online tool for developing interactions, but it’s quite primitive and cumbersome to use for anything but the simplest of skills.  Probably the biggest gap in the ADK is the lack of support for “slot filling”.  Slot filling is what speech interfaces do when they don’t get all the info needed to complete a task.  For example, let’s say you’re developing a movie ticket purchase skill.  In a perfect world every user would properly say, “I’d like two adult tickets to the 5:00 PM showing of Star Wars today.”  Given that our users will be rude and not behave the way we want them to, it’s likely they’ll say something like, “I want two tickets to Star Wars.”  It’s our skill’s responsibility to discover the [ ticket type ], [ showtime ], and [ show date ].  Our skill would likely next as the user: “How many tickets do you want to buy?” and so on.  That’s slot filling.
 
Alexa provides no native tools for managing slot filling, so it’s left to the developer to implement the functionality on their own service (which Alexa calls via “web hooks”.  Here’s an approach we use here at Bonial:
 
  • Create a Conversation object (AlexaConversation) that encapsulates the current state of the dialog and the business logic for determining next steps.  The constructor takes the request model from Alexa, which includes a “Session” context.   Conversations expose three methods:
    1. get_status() – whether the current dialog is complete or not
    2. get_next_slot() – if the dialog is not complete, which slot needs to be filled next
    3. get_session_context() – the new session context JSON to be sent back to Alexa (and then returned to the app on the next call) – basically the dialog state
class Conversation:
    __metaclass__ = ABCMeta

    model = None
    status = None
    type = None

    # pass in the underlying model or data needed to assess the current state of the dialog
    def __init__(self, model):
        self.model = model

    @abstractmethod
    def get_status(self):
        None

    @abstractmethod
    def get_next_slot(self):
        None

    @abstractmethod
    def get_session_context(self):
        None
  • When a request from Alexa arrives, we simply create an AlexaConversation with the request JSON and ask whether the current dialog is complete or not.  If it is complete, we then pass the dialog to the business logic layer for interpretation and processing (more in this later).  If not complete we respond to Alexa with a prompt to ask for the next slot.  Repeat.
 
So far it’s working well and reduces the complexity of the processing code.  Unfortunately both the dialog rules (how many slots, which are required, which order) is in the code, as are the slot prompts.  Are next step will be to move both of these into a declarative format so the VUI designers will have the flexibility to edit without involving the coders.
 
We assume this will be a stop-gap until the ASK and other resources have proper slot-filling capabilities.  We’d also love to hear how you’re approaching this challenge.