As more and more enterprises continue to double down on the power of generative AI, organizations are racing to build more competent offerings for them. Case in point: Lumiere, a space-time diffusion model proposed by researchers from Google, Weizmann Institute of Science and Tel Aviv University to help with realistic video generation.
The paper detailing the technology has just been published, although the models remain unavailable to test. If that changes, Google can introduce a very strong player in the AI video space, which is currently being dominated by players like Runway, Pika and Stability AI.
The researchers claim the model takes a different approach from existing players and synthesizes videos that portray realistic, diverse and coherent motion – a pivotal challenge in video synthesis.
What can Lumiere do?
At its core, Lumiere, which means light, is a video diffusion model that provides users with the ability to generate realistic and stylized videos. It also provides options to edit them on command.
Users can give text inputs describing what they want in natural language and the model generates a video portraying that. Users can also upload an existing still image and add a prompt to transform it into a dynamic video. The model also supports additional features such as inpainting, which inserts specific objects to edit videos with text prompts; Cinemagraph to add motion to specific parts of a scene; and stylized generation to take reference style from one image and generate videos using that.
“We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation,” the researchers noted in the paper.
While these capabilities are not new in the industry and have been offered by players like Runway and Pika, the authors claim that most existing models tackle the added temporal data dimensions (representing a state in time) associated with video generation by using a cascaded approach. First, a base model generates distant keyframes and then subsequent temporal super-resolution (TSR) models generate the missing data between them in non-overlapping segments. This works but makes temporal consistency difficult to achieve, often leading to restrictions in terms of video duration, overall visual quality, and the degree of realistic motion they can generate.
Lumiere, on its part, addresses this gap by using a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model, leading to more realistic and coherent motion.
“By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales,” the researchers noted in the paper.
The video model was trained on a dataset of 30 million videos, along with their text captions, and is capable of generating 80 frames at 16 fps. The source of this data, however, remains unclear at this stage.
Performance against known AI video models
When comparing the model with offerings from Pika, Runway, and Stability AI, the researchers noted that while these models produced high per-frame visual quality, their four-second-long outputs had very limited motion, leading to near-static clips at times. ImagenVideo, another player in the category, produced reasonable motion but lagged in terms of quality.
“In contrast, our method produces 5-second videos that have higher motion magnitude while maintaining temporal consistency and overall quality,” the researchers wrote. They said users surveyed on the quality of these models also preferred Lumiere over the competition for text and image-to-video generation.
While this could be the beginning of something new in the rapidly moving AI video market, it is important to note that Lumiere is not available to test yet. The company also notes that the model has certain limitations. It can not generate videos consisting of multiple shots or those involving transitions between scenes — something that remains an open challenge for future research.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.