UW lip-sync from audio
Given an audio clip of former President Obama speaking about health care at a campaign event and an existing video of a weekly address in the Oval Office, for instance, a new UW system can synthesize a realistic, lip-synced video of Obama delivering the health care speech. (UW screen grab)

Researchers at the University of Washington are putting former President Barack Obama’s words into his own mouth to demonstrate breakthrough technology in the field of computer vision. By turning audio clips into realistic-looking lip-synced video, the implication is that a moving face could be applied to historic audio recordings or be used to improve video conferencing.

The results are detailed in a paper being presented Aug. 2 at SIGGRAPH 2017, a leading conference in computer graphics and interactive techniques. Obama was chosen as a subject because the machine-learning technique needs a hefty supply of available video to train itself on.

In the demonstration, video of Obama from a range of appearances is used to deliver audio spoken in separate instances. A recurrent neural network, trained on many hours of the president’s weekly address footage, learns the mapping from raw audio features to mouth shapes. Researchers Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman summarize it like so:

Given the mouth shape at each time instant, we synthesize high quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track. Our approach produces photorealistic results.

“These type of results have never been shown before,” Kemelmacher-Shlizerman, an assistant professor at the UW’s Paul G. Allen School of Computer Science & Engineering, said in a UW news release. “Realistic audio-to-video conversion has practical applications like improving video conferencing for meetings, as well as futuristic ones such as being able to hold a conversation with a historical figure in virtual reality by creating visuals just from audio. This is the kind of breakthrough that will help enable those next steps.”

The findings could also lead to improved video chat performance, since streaming audio over the internet takes up far less bandwidth than video.

“When you watch Skype or Google Hangouts, often the connection is stuttery and low-resolution and really unpleasant, but often the audio is pretty good,” said Seitz, co-author of the paper and an Allen School professor. “So if you could use the audio to produce much higher-quality video, that would be terrific.”

The new method improves upon previous audio-to-video conversion processes which involved filming multiple people in a studio saying the same sentences over and over to try to capture how a particular sound correlates to different mouth shapes, the UW says. Suwajanakorn developed algorithms that can learn from videos that exist “in the wild” on the internet or elsewhere.

“There are millions of hours of video that already exist from interviews, video chats, movies, television programs and other sources. And these deep learning algorithms are very data hungry, so it’s a good match to do it this way,” Suwajanakorn said.

The technique also combines previous research which generated a good deal of attention near the end of 2015 and even won Innovation of the Year at the 2016 GeekWire Awards.

UW lip-sync audio video
A neural network first converts the sounds from an audio file into basic mouth shapes. Then the system grafts and blends those mouth shapes onto an existing target video and adjusts the timing to create a realistic, lip-synced video of the person delivering the new speech. (UW Graphic)

It’s pretty fascinating to watch the audio coming from Obama in an original clip and to then shift your focus to the unrelated video and hear the same words but not be distracted by what UW’s algorithm has been able to create with mouth movements. It’s especially true with audio from a much-younger Obama synced to his more current face.

UW says that right now the neural network is designed to learn on one individual at a time, meaning that Obama’s voice, speaking his actual words, is the only information used to “drive” the synthesized video.

“You can’t just take anyone’s voice and turn it into an Obama video,” Seitz said. “We very consciously decided against going down the path of putting other people’s words into someone’s mouth. We’re simply taking real words that someone spoke and turning them into realistic video of that individual.”

Like what you're reading? Subscribe to GeekWire's free newsletters to catch every headline

Job Listings on GeekWork

Find more jobs on GeekWork. Employers, post a job here.