Researchers at Alibaba‘s Institute for Intelligent Computing have developed a new artificial intelligence system called “EMO,” short for Emote Portrait Alive, that can animate a single portrait photo and generate videos of the person talking or singing in a remarkably lifelike fashion. 

The system, described in a research paper published on arXiv, is able to create fluid and expressive facial movements and head poses that closely match the nuances of a provided audio track. This represents a major advance in audio-driven talking head video generation, an area that has challenged AI researchers for years.

“Traditional techniques often fail to capture the full spectrum of human expressions and the uniqueness of individual facial styles,” said lead author Linrui Tian in the paper. “To address these issues, we propose EMO, a novel framework that utilizes a direct audio-to-video synthesis approach, bypassing the need for intermediate 3D models or facial landmarks.”

Directly converts audio to video

The EMO system employs an AI technique known as a diffusion model, which has shown tremendous ability for generating realistic synthetic imagery. The researchers trained the model on a dataset of over 250 hours of talking head videos curated from speeches, films, TV shows, and singing performances.

VB Event

The AI Impact Tour – NYC

We’ll be in New York on February 29 in partnership with Microsoft to discuss how to balance risks and rewards of AI applications. Request an invite to the exclusive event below.

 


Request an invite

Unlike previous methods that rely on 3D face models or blend shapes to approximate facial movements, EMO directly converts the audio waveform into video frames. This allows it to capture subtle motions and identity-specific quirks associated with natural speech.

According to experiments described in the paper, EMO significantly outperforms existing state-of-the-art methods on metrics measuring video quality, identity preservation, and expressiveness. The researchers also conducted a user study that found the videos generated by EMO to be more natural and emotive than those produced by other systems.

Generates realistic singing videos

Beyond conversational videos, EMO can also animate singing portraits with appropriate mouth shapes and evocative facial expressions synchronized to the vocals. The system supports generating videos for an arbitrary duration based on the length of the input audio.

“Experimental results demonstrate that EMO is able to produce not only convincing speaking videos but also singing videos in various styles, significantly outperforming existing state-of-the-art methodologies in terms of expressiveness and realism,” the paper states.

The EMO research hints at a future where personalized video content can be synthesized from just a photo and an audio clip. However, ethical concerns remain about potential misuse of such technology to impersonate people without consent or spread misinformation. The researchers say they plan to explore methods to detect synthetic video.

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.

Source link