Join leaders in San Francisco on January 10 for an exclusive night of networking, insights, and conversation. Request an invite here.
AI companies are racing to master the art of video generation. Over the last few months, several players in the space, including Stability AI and Pika Labs, have released models capable of producing videos of different types with text and image prompts. Building on that work, Microsoft AI has dropped a model that aims to deliver more granular control over the production of a video.
Dubbed DragNUWA, the project supplements the known approaches of text and image-based prompting with trajectory-based generation. This allows users to manipulate objects or entire video frames with specific trajectories. This gives an easy way to achieve highly controllable video generation from semantic, spatial and temporal aspects – while ensuring high-quality output at the same time.
Microsoft has open-sourced the model weights and demo for the project, allowing the community to try out it. However, it is important to note that this is still a research effort and remains far from perfect.
What makes Microsoft DragNUWA unique?
Historically, AI-driven video generation has revolved around either text, image or trajectory-based inputs. The work has been pretty good, but each approach has struggled to deliver fine-grained control over the desired output.
VB Event
The AI Impact Tour
Getting to an AI Governance Blueprint – Request an invite for the Jan 10 event.
The combination of text and images alone, for instance, fails to convey the intricate motion details present in a video. Meanwhile, images and trajectories may not adequately represent future objects and trajectories and language can result in ambiguity when expressing abstract concepts. An example would be failing to differentiate between a real-world fish and a painting of a fish.
To work around this, in August 2023, Microsoft’s AI team proposed DragNUWA, an open-domain diffusion-based video generation model that brought together all three factors – images, text and trajectory – to facilitate highly controllable video generation from semantic, spatial and temporal aspects. This allows the user to strictly define the desired text, image and trajectory in the input to control aspects like camera movements, including zoom-in or zoom-out effects, or object motion in the output video.
For instance, one could upload the image of a boat in a body of water and add a text prompt “a boat sailing in the lake” as well as directions marking the boat’s trajectory. This would result in a video of the boat sailing in the marked direction, giving the desired outcome. The trajectory provides motion details, language gives details of future objects and images add the distinction between objects.
Released on Hugging Face
In the early 1.5 version of the DragNUWA, which has just been released on Hugging Face, Microsoft has tapped Stability AI’s Stable Video Diffusion model to animate an image or its object according to a specific path. Once matured, this technology can make video generation and editing a piece of cake. Imagine being able to transform backgrounds, animate images and direct motion paths just by drawing a line here or there.
AI enthusiasts are excited about the development, with many calling it a big leap in creative AI. However, it remains to be seen how the research model performs in the real world. In its tests, Microsoft claimed that the model was able to achieve accurate camera movements and object motions with different drag trajectories.
“Firstly, DragNUWA supports complex curved trajectories, enabling the generation of objects moving along the specific intricate trajectory. Secondly, DragNUWA allows for variable trajectory lengths, with longer trajectories resulting in larger motion amplitudes. Lastly, DragNUWA has the capability to simultaneously control the trajectories of multiple objects. To the best of our knowledge, no existing video generation model has effectively achieved such trajectory controllability, highlighting DragNUWA’s substantial potential to advance controllable video generation in future applications,” the company researchers noted in the paper.
The work adds to the growing mountain of research in the AI video space. Just recently, Pika Labs made headlines by opening access to its text-to-video interface that works just like ChatGPT and produces high-quality short videos with a range of customizations on offer.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.