In what it calls a “major breakthrough in speech recognition,” Microsoft has built technology that can decipher conversation as well as people can.
A group of researchers and engineers with Microsoft’s Artificial Intelligence and Research team published a paper Monday on a computer system that makes about the same number of errors, or less, as professional transcriptionists.
That doesn’t mean the system is perfect, just that it didn’t make any more mistakes than humans. A person has a “word error rate of 5.9 percent,” and the research team was close to matching that about a year ago. Microsoft was able to reach parity with human transcriptionists by employing “neural language models in which words are represented as continuous vectors in space, and words like “fast” and “quick” are close together,” according to the blog post.
Microsoft said this breakthrough has wide-ranging applications from its products, from Xbox, to an instant voice-to-text service to a much smarter version of its digital assistant Cortana.
Microsoft has been researching speech innovations for a long-time. Before this, its biggest recent breakthrough was Skype Translator, which allows people who speak different languages to converse over Skype. The next step, Microsoft says, is to make it easier for speech recognition systems to work in real-life settings, such as parties, or while a user is driving on the highway.
Microsoft’s next long-term goal involves moving from recognizing speech to understanding it. That would mean computers could answer questions and react to speakers. While the prevalence of digital assistants like Cortana and Amazon’s Alexa may make it seem like that kind of technology isn’t far off, Microsoft says fully reactive artificial intelligence is a big lift.
“It will be much longer, much further down the road until computers can understand the real meaning of what’s being said or shown,” Microsoft Executive Vice President Harry Shum said in the blog post.