Microsoft says its speech recognition technology has achieved a new industry milestone, reducing its error rate to 5.1 percent, matching the error rate of multiple human transcribers in a widely recognized accuracy test.
The new result, announced this evening by the company’s Artificial Intelligence & Research Group, beats Microsoft’s previous low of 5.9 percent, reported last year; and the 5.5 percent error rate announced by IBM earlier this year.
The Microsoft Research group’s speech recognition work provides underlying technology used in products including its Cortana virtual assistant, Presentation Translator, and Microsoft Cognitive Services.
In the latest tests, Microsoft reduced its error rate in “a series of improvements to our neural net-based acoustic and language models,” says Microsoft technical fellow Xuedong Huang in a post explaining the achievement.
This is part of a broader effort by Microsoft to advance the state of the art in artificial intelligence, and bring those new approaches to market. Under CEO Satya Nadella, Microsoft last year formed a new 5,000-person Artificial Intelligence & Research Group as a fourth engineering division inside the company, along with the Office, Windows and cloud groups.
Microsoft competes against Amazon, Apple, IBM, Google and other major technology players in artificial intelligence and the cloud. The Redmond company’s new vision statement specifically adds a reference to artificial intelligence, saying its strategy is to build “best-in-class platforms and productivity services for an intelligent cloud and an intelligent edge infused with artificial intelligence.”
Here’s how Huang explains what they did to reach this latest milestone.
We introduced an additional CNN-BLSTM (convolutional neural network combined with bidirectional long-short-term memory) model for improved acoustic modeling. Additionally, our approach to combine predictions from multiple acoustic models now does so at both the frame/senone and word levels.
Moreover, we strengthened the recognizer’s language model by using the entire history of a dialog session to predict what is likely to come next, effectively allowing the model to adapt to the topic and local context of a conversation.
Our team also has benefited greatly from using the most scalable deep learning software available, Microsoft Cognitive Toolkit 2.1 (CNTK), for exploring model architectures and optimizing the hyper-parameters of our models. Additionally, Microsoft’s investment in cloud compute infrastructure, specifically Azure GPUs, helped to improve the effectiveness and speed by which we could train our models and test new ideas.
The Microsoft researchers further document their improved speech recognition system in this technical report.