As we slide into 2017, speech recognition is all the rage – it was the darling of CES and you can’t pick up a business or tech journal without reading about the phenomenon. The sudden explosion of interest and coverage could lead one to assume this is yet another hype bubble waiting to burst. I don’t think so, and I’ll explain why below. But first let’s roll back the calendar and look at the evolution of the technology… 2016… 2015… 2014… 2010… 2005…
In the late 1990’s and early 2000’s, speech recognition was a hot topic. Powered by early versions of machine learning algorithms and improvements in processing power, and boosted by the advent of VoiceXML which brought web methods to voice, pundits preached of the day when a “voice web” would rise to rival the dot-com bubble.
It never happened.
Implementors quickly found an Achille’s heel in speech interfaces: the single-dimensional information stream provided by audio was no match for two-dimensional visual presentation of the web and apps – it was simply too cumbersome to consume large quantities of information via audio. This relegated speech to “desperation” scenarios where visual simply wasn’t an option or to human automation scenarios (e.g. call centers).
Fast forward a decade and a half. Siri which, for all that its been maligned, was a watershed moment for speech. It came with reasonable accuracy and with a set of natural use cases (hand free driving, message dictation, cold-weather hands-free operation, etc.). It took speech mainstream.
What Siri started, Amazon Echo took to the next level. Rather than requiring the user to interrupt the natural flow of their lives to fiddle with their phone, Alexa is always on and ready to go (so long as you’re near it, of course). This means Alexa enables Micro-Moments and integrates into one’s normal flow of life. The importance of that can’t be understated.
Over the last six months other tech giants have started falling over themselves to respond to the market success of Echo and the surprising stats coming in from the market: 20% of mobile searches via speech, 42% of people using voice assistants, etc. Google recently released “Home” and is plugging Assistant into its Pixel phone and other devices. Facebook and others and trailing close behind. And Apple is working to regain it’s early lead by freeing Siri from the confines of the phone and laptop.
So where’s it all going?
To speculate on that we should probably look at why consumer speech recognition is succeeding this time. First, improvements in processing power and neural network / “deep learning” algorithms dropped the cost and radically improved the accuracy of speech recognition. This has allowed speech + AI to subtly creep into more and more user facing apps (e.g. Google Translate) which both conditioned users as well as helped train the speech engines. The technology is still limited to single-dimensional streams, but the enormous popularity of chat and, more recently, bots shows that there is plenty of attraction to this single dimension.
But speech is still limited – for example the best recognition engines need a network connection to use cloud resources and noisy environments common to cityscapes continue to confound recognition engines. This is why the Echo approach is genius – an always-on device with a great microphone in a (somewhat) quite environment. But will speech move beyond the use case of playing music or getting the weather?
Yes. Advanced headphones like the Apple AirPods will expand “always on” beyond the home. Improved algorithms will handle noisy environments. But perhaps most important – multi-modal interfaces are now eminently possible.
What’s multi-modal? Basically an interaction that spans multiple interfaces simultaneously. For example, you may start an interaction via voice but complete it via a mobile device – like asking your voice assistant to read your email headers but then forwarding an email of interest to the nearest screen to be read and responded to. Fifteen years ago there simple weren’t too many options for bouncing between speech and graphical interfaces. Today, the ubiquity of connected smart phones changes the equation.
Will this be the last big wave of speech? No. Until the speech interface is backed by full AI it can’t reach it’s full potential. Likewise there’s still a lot of runway in terms of interface devices. But this time it’s here to stay.