What happens in the video above is a potential breakthrough in speech recognition technology that could help to break down language barriers between countries and cultures.

Microsoft Chief Research Officer Rick Rashid is speaking to an audience of Chinese students. He says a sentence in English, and seconds later, a computer speaks his words in Mandarin AND in his own voice.

Rick Rashid demonstrates Microsoft’s new speech recognition technology.

As Rashid explains in this blog post and in his talk, the translation remains “far from perfect.” However, he says Microsoft and university researchers have been able to achieve “the most dramatic change in accuracy” since the late 1970s.

“We have been able to reduce the word error rate for speech by over 30% compared to previous methods,” Rashid writes. “This means that rather than having one word in 4 or 5 incorrect, now the error rate is one word in 7 or 8.”

Rashid explains that Microsoft researchers teamed with the University of Toronto to use a technique called Deep Neural Networks, which is patterned after human brain behavior, to improve speech recognition. In the video, Rashid’s words are converted quickly to Mandarin on the adjacent screen.

But what got the big cheers from the audience of Chinese students was the ability of the computer to speak back the Mandarin translation in Rashid’s voice.  That was based on a system built by Microsoft researchers that took into account speech from a native Chinese speaker and an hour of recordings of past speeches by Rashid.

“We hope in a few years that we’ll be able to break down the language barriers between people,” said Rashid during the speech, as the computer continued to speak his words in Mandarin. “Personally I believe this is going to lead to a better world.”

At the end of the talk, he says, “thank you,” and when the system speaks the words, you can hear the audience start to roar as the video ends.

Google is pursuing something similar with its Google Translate mobile app. However, it does not speak in the voice of the speaker.

The applications for this type of technology are endless. Say Japanese executives with Nintendo are meeting at the Redmond headquarters and need to say something during a meeting. Use the translator. Or what if you’re ordering food in France but the waiter doesn’t understand English. Use the translator.

I covered the Seattle Mariners last summer, and Ichiro rarely did interviews. When he did, he always insisted on speaking Japanese with his translator beside him despite the fact that he’s lived in Seattle for 11 years and speaks perfect English. This new technology could have allowed Ichiro to easily conduct interviews with the local media and possibly put his translator out of work.

Like what you're reading? Subscribe to GeekWire's free newsletters to catch every headline


  • Guest

    Why are they touting a breakthrough from two years ago? MS has spent more time and money investing in speech rec than anyone I know. Yet if you asked the average consumer who is the leader in this space they’d say Apple or Google.

    • guest

      Shareholder meeting coming up in a few weeks. Have to take their mind off the fact that the stock again performed worse than the market despite all the product announcements.

  • http://synergytranscriptionservices.com/Audio-Transcription.aspx Audio Transcription

    Day by day automatic voice recognition technology is improving, but it is not still good enough to produce accurate transcription as manual transcription services.

  • Rick

    Further proof of how amazing our brains are and how very far away we are from making something that is truly equal to the task – but maybe someday…

Job Listings on GeekWork