OpenAI on Thursday announced Sora, a brand new model that generates high-definition videos up to one minute in length from text prompts. Sora, which means “sky” in Japanese, won’t be available to the general public any time soon. Instead, OpenAI is making it available to a small group of academics and researchers who will assess harm and its potential for misuse.
“Sora is able to generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background,” the company said . “The model understands not only what the user has asked for in the prompt, but also how those things exist in the physical world.”
One of the videos generated by Sora that OpenAI shared on its website shows a couple walking through a snowy Tokyo city as cherry blossom petals and snowflakes blow around them.
Another shows realistic-looking wooly mammoths walking through a snowy meadow against a backdrop of snow-clad mountain ranges.
Prompt: “Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance… pic.twitter.com/Um5CWI18nS
— OpenAI (@OpenAI) February 15, 2024
OpenAI says that the model works as a result of “deep understanding of language,” which lets it interpret text prompts accurately. Still, like basically all AI image- and video-generators we’ve seen, Sora isn’t perfect. In one of the examples, the prompt, which asks for a video of a Dalmatian looking through a window and people “walking and cycling along the canal streets,” omits the people and the streets in the video entirely. OpenAI also warns that the model can struggle to understand cause and effect — it can generate a video of a person eating a cookie, for instance, but the cookie may not have bite marks.
Sora isn’t the first text-to-video model around. Other companies including , and , have either teased text-to-video tools or made them available to the public. Still, no other tool is currently able to generate videos as long as 60 seconds. Sora also generates entire videos at once, instead of putting them together frame-by-frame like other models, which makes sure that subjects in the video stay the same even when they go out of view temporarily.
The rise of text-to-video tools has sparked concerns over their potential to more easily create realistic-looking fake footage. “I am absolutely terrified that this kind of thing will sway a narrowly contested election,” Oren Etzioni, a professor at the University of Washington who specializes in artificial intelligence, and the founder of True Media, an organization that works to identify disinformation in political campaigns, The New York Times. And generative AI more broadly has sparked from artists and creative professionals concerned about the technology being used to replace jobs.
OpenAI that it was working with experts in areas like misinformation, hateful content and bias to test the tool before making it available to the public. The company is also building tools capable of detecting videos generated by Sora and including metadata in the generated videos for easier detection. The company to tell the Times how Sora had been trained, except stating that it used both “publicly available videos” as well as videos licensed from copyright holders.