OpenAI’s new — and first! — video-generating model, Sora, can pull off some genuinely impressive cinematographic feats. But the model’s even more capable than OpenAI initially made it out to be, at least judging by a technical paper published this evening.
The paper, titled “Video generation models as world simulators,” co-authored by a host of OpenAI researchers, peels back the curtains on key aspects of Sora’s architecture — for instance revealing that Sora can generate videos of an arbitrary resolution and aspect ratio (up to 1080p). Per the paper, Sora’s able to perform a range of image and video editing tasks, from creating looping videos to extending videos forwards or backwards in time to changing the background in an existing video.
But most intriguing to this writer is Sora’s ability to “simulate digital worlds,” as the OpenAI co-authors put it. In an experiment, OpenAI set Sora loose on Minecraft and had it render the world — and its dynamics, including physics — while simultaneously controlling the player.
“These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world, and the objects, animals and people that live within them,” the coauthors write.
Now, Sora’s usual limitations apply in the video game domain. The model can’t accurately approximate the physics of basic interactions like glass shattering. And even with interactions it can model, Sora’s often inconsistent — for example rendering a person eating a burger but failing to render bite marks.
Still, if I’m reading the paper correctly, it seems Sora could pave the way for more realistic — perhaps even photorealistic — procedurally generated games. That’s in equal parts exciting and terrifying (consider the deepfake implications, for one) — which is probably why OpenAI’s choosing to gate Sora behind a very limited access program for now.
Here’s hoping we learn more sooner rather than later.