New AI Video Goes Hard At Open AI!

Theoretically Media
29 Apr 202411:15

TLDRThe video discusses a new AI video generator called 'Vu', which is being compared to the anticipated Sora model. Vu, developed by Shinu Technology and Singua University, can produce high-quality 16-second clips at 1080p. The video showcases a sizzle reel and longer examples of Vu's output, highlighting its temporal coherence and the use of Universal Video Transformer (UViT) architecture. While not as detailed as Sora, Vu demonstrates impressive results, especially with its handling of camera movement and background consistency. The video also touches on the challenges of creating realistic AI-generated videos and the need for post-production work. There's a sign-up link for Vu, but it appears to be temporarily broken due to high demand.

Takeaways

  • 🎬 The new AI video generator, possibly named 'Vu', is being compared to the yet-to-be-released Sora model.
  • πŸ“Ί 'Vu' can generate video clips up to 16 seconds at 1080p resolution, as showcased in their Sizzle reel.
  • πŸ€– Developed by Shinu technology and Singua University, 'Vu' aims to compete with Sora in terms of video generation quality.
  • 🧠 The architecture of 'Vu' is based on the Universal Video Transformer (UvIT), which combines Vision Transformers and U-Net for improved image generation.
  • πŸ”— UvIT treats all elements as tokens and uses long skip connections, allowing it to chart a path between the first and last frames of a video.
  • πŸ“Ή Examples of 'Vu' outputs show temporal coherence and detailed background elements, although not as detailed as Sora's outputs.
  • πŸ†š In a side-by-side comparison, Sora's videos tend to have more action and clearer definition, but 'Vu' also presents a realistic environment.
  • 🌟 Both 'Vu' and Sora are capable of creating compelling imagery, though Sora may have a slight edge in terms of realism.
  • πŸ“½ A production process involving human effort is still necessary to achieve semi-consistency in AI-generated videos, as demonstrated in the short film 'Airhead'.
  • πŸ”— For those interested, there is a signup link for 'Vu' on their website, although it might be temporarily unavailable due to high demand.
  • πŸ“Œ The integration of Sora into Adobe Premiere and future plans for After Effects are discussed in an exclusive interview with Adobe.

Q & A

  • What is the name of the new AI video generator discussed in the transcript?

    -The new AI video generator discussed is referred to as 'Vu' or 'Vidu', developed by Shinu technology and Singua University.

  • What is the maximum duration and resolution that the AI video generator can produce?

    -The AI video generator can produce clips up to 16 seconds at 1080p resolution.

  • What is the architecture of the AI video generator based on?

    -The architecture of the AI video generator is based on UID, or Universal Video Transformer, which is a combination of two separate papers: DPM solver and 'All Are Worth Words'.

  • How does the Universal Video Transformer (Uvit) treat different elements in video generation?

    -Uvit treats everything, from time to specific conditions, as tokens and utilizes long skip connections, allowing it to chart a path between the first and last frame of the video.

  • What is the difference between the video generation approach of Sora and UVIT?

    -Sora creates videos by generating temporal spaces, whereas UVIT has an in and an out point and figures out the transitions between them, which helps in avoiding the hallucinatory warpy effects seen in traditional AI video generators.

  • What is the significance of the longer run time examples of Vidu's output?

    -The longer run time examples demonstrate Vidu's ability to maintain temporal coherence and generate detailed visuals, showcasing its potential as a competitive AI video generator.

  • How does the video output of Vidu compare to Sora in terms of realism and aesthetics?

    -While Vidu's output looks really good and maintains temporal coherence, it may not be as detailed or cinematically realistic as Sora's output. However, Vidu's aesthetic, particularly the mid-journey V4 look, is appreciated for its surreal quality.

  • What are some of the challenges faced by AI video generators like Sora and Vidu?

    -Challenges include maintaining temporal coherence, generating detailed and realistic visuals, and avoiding hallucinatory effects. Additionally, post-production work is often required to clean up and refine the generated footage for a polished final product.

  • How was Sora utilized in the short film 'Airhead'?

    -Sora was used to generate initial video footage for 'Airhead', which then required significant post-production work, including cleaning up the footage, script writing, editing, voice over, music, sound design, color correction, and other typical post-production processes.

  • What is the current status of the sign-up link for Vidu on their website?

    -As of the time of the transcript recording, the sign-up link on Vidu's website appears to be broken, possibly due to high traffic. It is suggested to try again after a day or two if it does not work.

  • What is the significance of the 'Tokyo walk' sequence in the comparison between Vidu and Sora?

    -The 'Tokyo walk' sequence is used to illustrate the comparative quality of the video outputs from Vidu and Sora. Despite the short clip length, it shows that both models can produce fairly comparable results, although there are inherent challenges in the gait and realism of the generated footage.

  • How does the transcript suggest the future of AI video generation technology?

    -The transcript suggests that AI video generation technology, even in its current state, can be used to create compelling imagery. It also highlights the potential for integration into professional tools like Adobe Premiere and future plans for enhancements in post-production software like After Effects.

Outlines

00:00

πŸš€ Introduction to a Potential Sora Rival: Vu

The video introduces a new AI video generator named 'Vu', which is being considered as a potential competitor to Sora, despite Sora not being released yet. The presenter discusses the irony and dives into the features of Vu, which is capable of generating 16-second clips at 1080p resolution. The video showcases a sizzle reel from Vu, highlighting its direct references to Sora's initial video release. Vu's architecture is based on the Universal Video Transformer (UViT), which combines Vision Transformers for image analysis with a Unet model for image generation. This allows Vu to have a clear start and end frame, potentially avoiding the hallucinatory effects seen in some AI video generators. The presenter also shares a few examples of longer video outputs from Vu, noting the quality and temporal coherence in the clips.

05:02

πŸŽ₯ Analysis of Longer Vid Outputs and Comparison with Sora

The presenter provides an in-depth analysis of longer video outputs from the Vidu model, comparing them with Sora. The video outputs from Vidu are described as high-quality, with temporal coherence and detailed visuals, although not as detailed as Sora's outputs. The presenter appreciates the aesthetic of the mid-journey V4 look in the Vidu outputs. The video also includes a comparison with Sora, noting that while Sora's videos are more action-packed and detailed, Vidu's outputs are still impressive and create a sense of a real place. The presenter acknowledges that both models have their strengths and that the examples shown are cherry-picked. They also discuss the effort required to clean up Sora footage for a final feature, highlighting the post-production process involved in utilizing AI-generated videos.

10:05

πŸ“š Utilizing AI in Filmmaking and Future of Vidu

The video concludes with a discussion on the practical application of AI in filmmaking, referencing Paul Trello's VFX breakdown and his use of AI imagery in his short film 'Notes to My Future Self'. The presenter describes the process of integrating AI-generated images with live-action footage and the various tools used to enhance the scenes. Additionally, the presenter provides a sign-up link for Vidu, noting that there might be temporary issues due to high demand. The video ends with a teaser for an upcoming interview with Adobe about Sora's integration into Premiere and future plans for After Effects.

Mindmap

Keywords

Sora

Sora is an AI video generation model that is referenced as a benchmark for comparison in the video. It is used to gauge the capabilities of the new AI video generator discussed, which is trying to reach or surpass the quality of Sora's outputs. The video mentions Sora's ability to create temporal spaces for its videos, a feature that distinguishes it from the new model being discussed.

Vu (Vidu)

Vu, also referred to as Vidu in the transcript, is a new AI video generator that is the main subject of the video. It is capable of generating video clips up to 16 seconds at 1080p resolution. The video discusses its potential to compete with Sora and provides examples of its output quality.

Universal Video Transformer (UvIT)

UvIT stands for Universal Video Transformer, which is the architecture that the new AI video generator, Vu, is based on. It is a culmination of two separate papers, DPM solver and 'All Are Worth Words', and it treats all elements of a video as tokens, utilizing long skip connections to maintain coherence throughout the video generation process.

DPM Solver

DPM Solver is one of the two papers that contribute to the UvIT architecture. It is mentioned to help diffusion models make better predictions about future generations within the video. It is described as a math-intensive paper, indicating its complexity and the technical nature of its contribution to video generation.

All Are Worth Words

This is the second paper that contributes to the UvIT architecture. While still complex, it is less math-intensive than the DPM Solver and is more accessible to the narrator. It is part of the foundation that allows the UvIT to combine Vision Transformers with a Unet model for improved image generation.

Vision Transformers

Vision Transformers are a type of model that excels at analyzing and understanding images. In the context of the video, they are integrated into the UvIT architecture to help with the recognition and processing of visual information within the video generation process.

Unet

Unet is an older model that is good at generating images. In the UvIT architecture, it is combined with Vision Transformers to create a powerful system for video generation. The Unet contributes to the image generation capabilities of the new AI video generator.

Temporal Coherence

Temporal coherence refers to the consistency and smooth transition of visuals over time within a video. The video discusses how the UvIT architecture maintains temporal coherence by knowing the first and last frame of the video, allowing it to chart a path between them, unlike traditional AI video generators.

Sizzle Reel

A sizzle reel is a short promotional video that showcases the best moments or highlights of a project. In the script, it is used to introduce the capabilities of the new AI video generator, Vu, with clips that reference the initial Sora video release.

Freebird

Freebird is a song by the rock band Lynyrd Skynyrd. In the context of the video, it is humorously mentioned as the song that a panda is playing on a guitar in one of the AI-generated video clips, showcasing the creative and imaginative aspect of the AI's output.

Tokyo Walk Sequence

The Tokyo Walk Sequence is a specific example of an AI-generated video clip shown in the video. It is used to compare the capabilities of the new AI video generator, Vu, with the Sora model, noting the differences in the quality and realism of the generated environment.

Highlights

A new AI video generator, potentially rivaling Sora, has been released.

The AI can generate clips up to 16 seconds at 1080p resolution.

The model is developed by Shinu Technology and Singua University.

Vidu's architecture is based on the Universal Video Transformer (UViT).

UViT combines Vision Transformers with a Unet model for image generation.

UViT treats all aspects from time to specific conditions as tokens.

Long skip connections allow UViT to maintain coherence between the first and last frames of a video.

Vidu's output is compared to Sora, with a focus on temporal coherence and detail.

A full 16-second clip showcases Vidu's ability to maintain temporal coherence in visuals.

Vidu's generated panda playing guitar demonstrates impressive background coherence and shadow reactivity.

A beach vacation villa clip from Vidu shows an interesting dissolve effect between shots.

Vidu's imaginative output includes a ship in a bedroom, reacting correctly to water movements.

A side-by-side comparison with Sora reveals differences in action clarity and environment realism.

Despite the comparison, both Vidu and Sora are noted as very good video generators.

The Tokyo walk sequence from Vidu shows the model's capability in generating realistic movements.

Sora's video generation process requires significant post-production work for consistency.

AI tools are being used to create compelling imagery, as demonstrated by Paul Trello's VFX breakdown.

Vidu has a signup link on their website, but the submit button may be temporarily broken due to high traffic.

The integration of Sora into Adobe Premiere and future plans for After Effects are discussed in an exclusive interview.