I have a confession: despite being in management I still love to code. Since I don’t get to program as much as I’d like or stay up on the latest trends and technologies, I set a goal for myself to learn at least one new technology every year (and more than one on a good year). This learning hobby is how I made the leap from back-end to full-stack developer, how I learned iOS and Android, and how I stepped into the hallowed halls of Data Science.
This year I decided to explore chat bots and voice assistants. As I learn best by doing, I generally think up a fun or useful project and then learn through building it. For this project I decided to tackle an unending source of stress in my household: bickering and arguing over screen time for our kids.
Enter ChronosBot
The idea behind ChronosBot is simple. Parents set up screentime accounts for each child as well as an an automatic allowance that puts time in the accounts. After linking their account to Alexa, Google Assistant, Facebook Messenger, etc., they can say or write things like, “Alexa, ask ChronosBot to withdraw 30 minutes from Axel’s account” or “… what’s everyone’s balance?”
With the idea in place, I had to choose my tech stack. Google has a robust platform built on API.AI. API.AI supports a dozen or so chat integrations (Allo, Messenger, Telegram, Kik, etc.) as well as a voice interface for Google Home, allowing developers to (theoretically) write one interface for both voice and chat. At the time I started, Amazon Alexa had a rudimentary platform for speech dialog development using structures text. In both platforms the interface designer creates “intents” that match what the user says to something the bot can do and then provides appropriate responses, and both platforms hand off the business logic to a backend app using web hooks.
For the backend, I decided to sharpen my python skills and implement in Django on top of Postgres. For deployment I decided to give Heroku a try.
Development of the basic use cases took me a couple of weeks in the late evening and weekends. I submitted to both Amazon and Google and waited for a week or so in each case for the review. Both rejected my app, but for reasons that I hadn’t expected. Amazon told me that my app violated the Alexa ToCs because it “targeted children” (huh?) and told me to not resubmit the app ever again (seems they relented). Google gave me the boot because my invocation name couldn’t be recognized properly but a very helpful person from Google worked with me to resolve the issue and now it’s live.
I’ve since continued development and added new features like “rewards” and “penalties” (requested by my wife) and “mystery bonus” (requested by the kids). I’ve enabled Telegram and Messenger and have adapted the platform to support both visual and audio surfaces. And the Alexa version was finally approved earlier this week.
Lessons Learned
So, what have I learned while navigating the ins and outs of the Google and Alexa development platform and publication process?
1) Amazon and Google have very different approaches. Google has taken the bold approach of enabling all community developed actions and using an intent matching algorithm to route users to the correct action. Amazon requires users to enable specific skills via a Skill Store. In both cases, discovery is a largely unsolved challenge.
2) Too early to tell who will be king. Amazon Alexa has a crazy head start, but Google seems to be a more robust speech development platform. With a zillion Android devices already on the market one certainly can’t count them out. On the other hand not a month seems to go by without a new Alexa form factor hitting the market.
3) It’s early days. Both platforms are being developed at a lightning fast pace. Google had a big head start with API.AI. The original Alexa interface was frustratingly primitive, but they’ve since upgraded to a new UI (which suspiciously bears a strong resemblance to API.AI) that has great promise.
I have to take my hat off to both companies for creating a paradigm and ecosystem that makes voice assistant and natural language development accessible to the broader development community. It’s so straight forward that even my kids gave it a try – my daughter (10) developed “The Oracle” that answers deeply profound questions like “Who’s awesome” (she’s awesome). My son (12) wrote a math quiz game with which he is happy to challenge anyone to beat his top score.
4) Conversational UX is easy; good conversational UX is really hard. I’ve known this since I was involved with Nuance and the voice web in the late 1990’s (and I also happen to be married to an expert in the space). Making it easy to build a conversational UX is a very different thing than helping developers build a high quality conversational UX (especially a Voice UX). Both Amazon and Google have tried to address this with volumes of best practice documentation, but I expect most developers will ignore it.
5) Conversational UX is limited. There are some use cases that work for serial interactions (voice or chat) and some that work better in parallel interactions (visual). Trying to force one into the other typically doesn’t make sense or only applies to “desperate users”. You see the effect of this to some degree already in the Alexa Skill Store – there are some clear clusters evolving (home automation, information retrieval, quiz games).
6) Multi-modal UX is the next natural step. I’m very excited about the Amazon Echo Show as I expect that will unleash a wave of interesting multi-modal interaction paradigms.
7) It’s fun. There’s just something about the natural language element of voice assistants that allows for a richer, more human interaction than what GUIs can provide.
All in all I’m really excited about the potential of this space, and I’m not alone – just look at the growth of the Alexa Skills Store. The tech press is also taking a critical look at these capabilities (e.g. a recent article featuring yours truly) and I expect most companies are at least thinking about how these capabilities will play in their business. My company, Bonial, is investing in several actions/skills to explore the potential of voice and chat interfaces. To date we’ve already launched a bot that allows users to search for local deals and will shortly launch a voice assistant interface to our shopping list app, Out of Milk. We’ve learned a lot and we’ll share more on those projects in other posts.