Sora follows a similar trajectory as large language models (LLMs) powering text-based products such as ChatGPT. Where LLMs use tokens — which are essentially collections of words and phrases — treating them as morsels of data for training and processing, Sora relies on patches. “At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space, and subsequently decomposing the representation into spacetime patches,” OpenAI explains.

At its heart, Sora is a diffusion model, which means it is fed noisy input data (patches, in this case), which it subsequently uses to generate a clean patch that appears as the final video. The inherent training tech is still the transformer model instead of the GAN-based text-to-video models that arrived a while ago. In a nutshell, Sora is a hybrid, or as OpenAI likes to call it, a diffusion transformer.

Sora also solves some extremely challenging aspects of AI video generation, especially when it comes to context-aware frame generation in 3D space from static as well as moving perspectives. The AI can sustain the visibility of people, animals, and items as they move through a three-dimensional space, even when they are hidden or leave the frame. It can also capture various angles of a single character in one instance, ensuring consistency in their visual portrayal throughout the video. The camera smoothly transitions and revolves, allowing individuals and elements in the scene to move seamlessly in a three-dimensional environment.

Source link