Google's LUMIERE AI Video Generation Has Everyone Stunned | Better than RunWay ML?

AI Unleashed - The Coming Artificial Intelligence Revolution and Race to AGI
24 Jan 202421:06

Summary

TLDREl script presenta Lumiere, la última herramienta de IA de Google, capaz de generar videos a partir de texto. Lumiere no solo convierte texto en video, sino que también permite animar imágenes existentes, crear videos en el estilo de una imagen o pintura, y realizar animaciones específicas dentro de imágenes. La tecnología detrás de Lumiere, basada en un modelo de difusión espacial-temporal, garantiza una coherencia temporal en los videos. Además, se explora cómo los modelos de IA generan videos y si aprenden algo más allá de las estadísticas superficiales, lo que podría cambiar el futuro de la producción de video y la interacción con la IA en el mundo físico.

Takeaways

  • 🌟 Google lanzó Lumiere, una herramienta de IA que convierte texto en video.
  • 🎨 Lumiere permite animar imágenes existentes y crear videos en el estilo de una imagen o pintura específica.
  • 🤖 La inteligencia artificial de Lumiere genera videos con una consistencia temporal mejorada, lo que significa que las escenas tienen una coherencia a lo largo de los frames.
  • 📜 Google publicó un documento explicando las mejoras en Lumiere, incluyendo un modelo de difusión espacial-temporal para la generación de videos realistas.
  • 🔮 Lumiere utiliza una arquitectura Spacetime Unit, que planifica la generación de todo el video de antemano, en lugar de frame por frame.
  • 🎭 Además de la conversión de texto a video, Lumiere ofrece funciones como la animación de secciones específicas dentro de las imágenes y el video en pintura.
  • 📹 La tecnología de Lumiere es capaz de realizar la stylización de video, cambiando el estilo de elementos visuales según sea necesario.
  • 🔍 Se discute la naturaleza de cómo los modelos de IA generan imágenes y videos, con debates sobre si aprenden más que estadísticas superficiales o si hay un entendimiento más profundo.
  • 🏆 Según estudios, Lumiere supera a otros modelos de punta en preferencia de usuario en la generación de texto a video e imagen a video.
  • 🚀 La IA en la producción de video está progresando rápidamente, lo que podría tener un impacto significativo en la industria del cine y la televisión en las próximas décadas.
  • 🌐 Runway ML, otro modelo de IA líder en la generación de texto a imagen, está trabajando en modelos de mundo generales para mejorar la coherencia y realismo en la generación de contenido.

Q & A

  • ¿Qué es Lumiere de Google y cómo funciona?

    -Lumiere es una herramienta de Inteligencia Artificial lanzada por Google centrada en la generación de videos a partir de texto. Funciona mediante un modelo de redes neuronales que traduce texto en video, permitiendo animar imágenes existentes y crear videos en el estilo de una imagen o pintura específica, entre otras funcionalidades.

  • ¿Qué es el Spacetime diffusion model y cómo se relaciona con Lumiere?

    -El Spacetime diffusion model es un modelo de generación de video realista utilizado en Lumiere. Se trata de un enfoque que permite generar la duración temporal completa de un video de una vez, lo que ayuda a mantener una consistencia temporal global, a diferencia de otros modelos que trabajan fotograma por fotograma.

  • ¿Cómo Lumiere mejora la consistencia temporal en los videos generados?

    -Lumiere logra una mayor consistencia temporal mediante su SpaceTime unet architecture, que genera la totalidad de la duración del video desde el principio, en lugar de crear fotogramas clave distantes y luego resolver la secuencia temporal, lo que a menudo resulta en cambios drásticos y falta de coherencia en el video.

  • ¿Qué es 'video en pintura' y cómo lo implementa Lumiere?

    -El 'video en pintura' es una técnica donde una parte de una imagen faltante es suplida por el AI para completar la escena. Lumiere implementa esta funcionalidad haciendo que el AI asuma y complete la escena basándose en pistas de la imagen visible, permitiendo la creación de una narrativa visual más completa.

  • ¿Cómo Lumiere maneja la animación de secciones específicas dentro de las imágenes?

    -Lumiere permite la animación de secciones específicas dentro de las imágenes, una funcionalidad conocida como 'cinemagraphs', donde solo ciertos elementos de la imagen se mueven, creando una ilusión de vida en una escena estática.

  • ¿Qué es una 'world model' y cómo se relaciona con el futuro de la IA según Runway ml?

    -Una 'world model' es un sistema de IA que construye una representación interna de un entorno y lo usa para simular eventos futuros dentro de ese entorno. Runway ml sugiere que el siguiente gran avance en IA vendrá de sistemas que comprenden el mundo visual y sus dinámicas, promoviendo la creación de modelos que simulen mundos completos y capturen la realidad con mayor profundidad.

  • ¿Cómo Lumiere se compara con otros modelos de generación de video en términos de preferencia del usuario?

    -Según los estudios mencionados en el guion, los videos generados por Lumiere son preferidos por los usuarios en comparación con otros modelos de generación de video actuales, como Pika y Genan 2, en ambos text to video e image to video generation.

  • ¿Qué es 'stylized generation' y cómo lo implementa Lumiere?

    -La 'stylized generation' es la capacidad de un modelo de IA para crear videos o imágenes en un estilo específico, como el de una pintura o una imagen de referencia. Lumiere implementa esto al utilizar una imagen de referencia para influir en la apariencia y el estilo del video generado.

  • ¿Cómo Lumiere contribuye a la simplificación de la producción de videos para personas comunes?

    -Lumiere contribuye a la simplificación de la producción de videos al permitir a las personas crear contenido de alta calidad sin las limitaciones financieras tradicionales. La generación de videos y voces AI, junto con la asistencia en la escritura de historias, permite a cualquier persona con talento creativo crear narrativas visuales sin grandes inversiones.

  • ¿Qué cambios ha habido en la calidad de la generación de videos AI en el último año según el guion?

    -Según el guion, en el último año ha habido un avance significativo en la calidad de la generación de videos AI. Hace un año o año y medio, los videos generados por IA presentaban formas bloqueadas y faltaba coherencia entre escenas, mientras que ahora los videos generados por Lumiere y otros modelos son mucho más consistentes y realistas.

  • ¿Qué papel juegan los modelos de 'generative models' en la comprensión de la IA y cómo se relaciona esto con la investigación de Google?

    -Los 'generative models' son modelos de IA capaces de crear contenido original basado en datos de entrada. La investigación de Google, junto con otros estudios, busca entender si estos modelos aprenden más que solo estadísticas superficiales, es decir, si hay algún tipo de comprensión o conocimiento más profundo que se está desarrollando en la IA, más allá de simples correlaciones entre píxeles y palabras.

Outlines

00:00

🚀 Lanzamiento de Lumiere, la nueva herramienta de AI de Google

Google ha lanzado Lumiere, una herramienta de inteligencia artificial que convierte texto en video. Esta modelo AI no solo traduce texto a video, sino que también permite animar imágenes existentes, crear videos en el estilo de una imagen o pintura y realizar animaciones específicas dentro de imágenes. Además, Lumiere utiliza un Spacetime diffusion model para mejorar la consistencia temporal en el video. Los videos generados a partir de prompts como 'bandera de EE. UU. ondeando' o 'oso caminando en Nueva York' muestran una gran consistencia y calidad. También se discuten las mejoras en la investigación de Google y la complejidad de cómo las redes neuronales AI generan estos videos.

05:00

🎨 Transformación de la producción de video con Lumiere

El script explora cómo Lumiere puede cambiar la producción de video en películas y series de televisión, permitiendo a las personas crear contenido de alta calidad en casa. Se menciona un estudio de Harvard titulado 'Beyond surface statistics', que cuestiona si los modelos AI aprenden más allá de las estadísticas superficiales al crear imágenes y videos. El estudio sugiere que estos modelos pueden estar desarrollando una comprensión más profunda de la geometría de la escena a pesar de solo haber sido entrenados con imágenes 2D, lo que podría ser una nueva dimensión en el aprendizaje profundo de AI.

10:01

🤖 Avances en la generación de video y modelos del mundo

Se discuten los avances en la generación de video con herramientas como Runway ml y cómo estas plataformas están introduciendo modelos del mundo para mejorar la coherencia y realismo en los videos generados por AI. Runway ml está llevando a cabo un esfuerzo de investigación a largo plazo para desarrollar 'General World models' que comprendan la visual y dinámicas del mundo, lo que podría ser el siguiente gran avance en la IA. También se comparan los resultados de Lumiere con otros modelos líderes en la industria, demostrando su superioridad en términos de consistencia y calidad.

15:03

📊 Comparación de modelos de generación de video

El script presenta una comparación detallada de los modelos de generación de video Lumiere, Pika y Gen 2, mostrando cómo Lumiere supera a sus competidores en varios aspectos como la consistencia temporal y la calidad de la imagen. Se incluyen ejemplos de prompts y cómo cada modelo los interpreta, destacando la capacidad de Lumiere de mantener la coherencia y la precisión en la representación de los elementos del prompt en el video generado.

20:05

🌟 La próxima revolución en la creación de contenido con AI

El video script concluye destacando la importancia de las herramientas de IA como Lumiere para la creación de contenido en el futuro. Se sugiere que la próxima generación de creadores de contenido podría utilizar estas herramientas para producir obras de arte y narrativas de alta calidad sin las limitaciones financieras tradicionales. Además, se menciona la posibilidad de utilizar simulaciones para crear historias donde los personajes y escenarios se desarrollan de manera orgánica, permitiendo a los creadores seleccionar los momentos y narrativas más impactantes.

Mindmap

Keywords

💡Lumiere

Lumiere es la herramienta de inteligencia artificial (IA) más reciente lanzada por Google, centrada en la generación de videos a partir de texto. Es fundamental para entender el tema del video, ya que representa el avance tecnológico en la creación de contenido visual basado en la entrada de texto. En el guion, se muestra cómo Lumiere puede traducir prompts de texto en videos, como 'bandera de EE. UU. ondeando' o 'un perro pug sintiendo bien escuchando música', demostrando su capacidad para generar contenido de video de manera coherente y realista.

💡Modelo de difusión espacial-temporal

Este concepto técnico se refiere al enfoque innovador de Lumiere para la generación de videos. La 'Spacetime diffusion model' permite que el modelo AI contemple la secuencia temporal completa de un video desde el principio, lo que contrasta con otros modelos que trabajan fotograma por fotograma. Este modelo busca lograr una mayor coherencia temporal en los videos generados, como se menciona en el guion al referirse a la 'temporal consistency' que Google ha logrado mejorar.

💡Consistencia temporal

La 'temporal consistency' se refiere a la coherencia y continuidad de los objetos y acciones a lo largo de un video. Es un aspecto clave en la generación de videos realistas por IA, ya que permite que las escenas sean comprensibles y estén en armonía con la narrativa. En el guion, se destaca cómo Lumiere logra esta consistencia, a diferencia de otros modelos que pueden mostrar cambios abruptos o incoherencias en el tiempo.

💡Imagen a video

La capacidad de Lumiere para transformar imágenes en animaciones, como se ilustra con ejemplos como 'un oso caminando por Nueva York' o 'Bigfoot caminando por el bosque', es un aspecto importante del modelo de IA. Esta función permite que se creen narrativas visuales a partir de una imagen estática, expandiendo las posibilidades creativas en la generación de contenido.

💡Estilización de video

El término 'video stylization' se refiere a la habilidad de Lumiere para aplicar un estilo específico a un video, basado en una imagen de referencia. Esto permite que los creadores de contenido apliquen un look distintivo a sus videos, como se muestra en el guion cuando se menciona el ejemplo de un oso 'twirling with delight' en el estilo de una imagen proporcionada.

💡Cinemagraph

Una 'cinemagraph' es un tipo de imagen que combina elementos estáticos con partes animadas para crear una ilusión de movimiento sutil. En el guion, se menciona que Lumiere puede realizar esta técnica, animando solo ciertas partes de una imagen, como el humo saliendo de un tren, lo que agrega una dimensión especial de realismo y arte a las imágenes estáticas.

💡Video en pintura

El 'video en pintura' es una técnica que permite que los videos o imágenes con porciones faltantes sean completados por la IA, creando una representación de lo que podría estar en la zona ausente. En el guion, se ejemplifica con una escena donde la mano del sujeto entra y sale de la imagen, y el modelo AI asume y rellena la zona faltante con hojas verdes, demostrando su capacidad para 'adivinar' y completar la escena.

💡Aprendizaje profundo

El 'aprendizaje profundo' es un subcampo de la inteligencia artificial que se centra en la creación de redes neuronales complejas capaces de aprender y mejorar con la experiencia. En el contexto del video, se hace referencia a cómo estos modelos de IA, como Lumiere, pueden estar aprendiendo algo más que solo estadísticas superficiales, posiblemente desarrollando una comprensión subyacente de los objetos y sus relaciones espaciales.

💡Modelos de mundo generales

Los 'modelos de mundo generales' son un concepto avanzado de IA que se menciona en el guion, donde se sugiere que el futuro del aprendizaje automático vendrá de sistemas que comprenden el mundo visual y su dinámica. Estos modelos buscan simular el mundo en su totalidad, lo que permitiría a la IA generar contenido más realista y coherente, como se ilustra con la idea de que un modelo de IA podría simular la física y el movimiento para crear videos de alta calidad.

💡Generación de contenido

La 'generación de contenido' es el proceso mediante el cual la IA crea material original, como texto, imágenes, videos o sonido, a partir de entradas o instrucciones básicas. En el guion, este concepto es central, ya que explora cómo herramientas como Lumiere están revolucionando la creación de contenido al permitir que se genere automáticamente, lo que tiene implicaciones para la industria del entretenimiento y el cine.

Highlights

Google lanza su nueva herramienta de inteligencia artificial Lumiere, un modelo de IA de texto a video.

Lumiere permite animar imágenes existentes y crear videos en el estilo de una imagen o pintura.

Google publicó un documento sobre las mejoras en Lumiere, incluyendo un modelo de difusión espacial-temporal para la generación de videos realistas.

Lumiere ofrece consistencia temporal en los frames, una mejora significativa en comparación con otros modelos.

Se muestra la capacidad de Lumiere para animar imágenes en IM a través de ejemplos como un oso caminando en Nueva York.

Lumiere puede generar estilos utilizando una imagen de referencia, creando videos con consistencia estilística.

Se introduce una nueva arquitectura de Spacetime unit que planea la generación de todo el video de una vez, en lugar de frame por frame.

Lumiere también permite la estilización de video, transformando fuentes de video en diferentes estilos.

La función de cinemagraphs de Lumiere permite animar solo ciertas partes de una imagen, como el humo de un tren.

Video y pintura de Lumiere usa el AI para completar partes faltantes de una imagen, como hojas verdes.

Lumiere puede cambiar la ropa de un personaje a lo largo de varias tomas basándose en un prompt de texto.

Se cuestiona cómo los modelos de IA transforman conceptos en imágenes, sugiriendo que podrían estar aprendiendo algo más profundo que solo estadísticas superficiales.

Un estudio de 'Beyond surface statistics' explora si los modelos de IA aprenden una comprensión más profunda de los objetos y la posición.

Los modelos de IA parecen ser capaces de crear representaciones internas relacionadas con la geometría de la escena, a pesar de solo ser entrenados con imágenes 2D.

Runway ml, otro modelo de IA de texto a imagen, permite la creación de películas generadas completamente por IA.

Runway ml introduce modelos de mundo generales para mejorar la coherencia y realismo en la generación de videos.

Lumiere se compara con otros modelos de punta, mostrando una preferencia significativa de los usuarios por la calidad de los videos generados por Lumiere.

Lumiere ofrece una mejora significativa en la consistencia y realismo de los videos generados por IA en comparación con tecnologías anteriores.

El avance en la tecnología de IA sugiere un futuro donde las personas pueden crear películas de estilo Hollywood con facilidad utilizando herramientas de IA.

Transcripts

play00:00

and just like that out of the blue

play00:02

Google drops its latest AI tool Lumiere

play00:05

Lumiere is at its core a text to video

play00:08

AI model you type in text and the AI

play00:11

neural Nets translate that into video

play00:15

but as you'll see Lumiere is a lot more

play00:17

than just text to

play00:19

video it allows you to animate existing

play00:22

images creating video and the style of

play00:25

that image or painting as well as things

play00:27

like video in painting and creating

play00:30

specific animation sections within

play00:32

images so let's look at what it can do

play00:35

the science behind it Google published a

play00:38

paper talking about what they improved

play00:40

and I'll also show you why the

play00:42

artificial AI brains that generate these

play00:46

videos are much more weird than you can

play00:50

imagine so this is lumere from Google

play00:52

research A Spacetime diffusion model for

play00:55

realistic video generation we'll cover

play00:57

SpaceTime diffusion model a bit later

play00:59

but right now now this is what they're

play01:01

unveiling so first of all there's text

play01:03

to video this is the video that are

play01:04

produced by various prompts like US flag

play01:07

waving on massive Sunrise clouds funny

play01:09

cute pug dog feeling good listening to

play01:11

music with big headphones and Swinging

play01:13

head Etc snowboarding Jack Russell

play01:16

Terrier so I got to say these are

play01:18

looking pretty good if these are good

play01:19

representations of the sort of style

play01:21

that we can get from this model this

play01:23

would be very interesting so for example

play01:25

take a look at this one astronaut on the

play01:27

planet Mars making a detour around his

play01:30

base this is looking very consistent

play01:33

this looks like a tablet this looks like

play01:35

a medicine tablet of some sort floating

play01:37

in space but I got to say everything is

play01:39

looking very consistent which is what

play01:42

they're promising in their research it

play01:43

looks like they found a way to create a

play01:45

more consistent shot across different

play01:47

frames temporal consistency as they call

play01:50

it here's image to video so as you can

play01:52

see that this is nightmarish but that's

play01:54

that's the scary looking one but other

play01:56

than that everything else is looking

play01:58

really good so they're taking IM images

play02:00

and turning them into animations little

play02:03

animations of a bear walking in New York

play02:05

for example Bigfoot walking through the

play02:08

woods so these were started with an

play02:10

image that then gets animated these are

play02:13

looking pretty good here are the Pillars

play02:15

of Creation animated right there that's

play02:17

uh pretty neat kind of a 3D structure

play02:20

they're showing styliz generation so

play02:22

using a Target image to kind of make

play02:24

something colorful or animated take a

play02:26

look at this elephant right here one

play02:28

thing that jumps out at me is it is very

play02:30

consistent there's no weirdness going on

play02:33

in a second we'll take a look at other

play02:34

leading AI models that generate video

play02:37

and I got to say this one is probably

play02:39

the smoothest looking one here's another

play02:41

one so as you can see here here's the

play02:43

style reference image so they want this

play02:45

style and then they say a bear twirling

play02:47

with delight for example right so then

play02:49

it creates a bear twirling with delight

play02:51

or a dolin leaping out of the water in

play02:53

the style of this image here's the same

play02:55

or similar prompts with this as the

play02:58

style reference now this as a the style

play03:00

reference I got to say it captures the

play03:02

style pretty well here's kind of that

play03:04

neon phosphorus glowing thing and they

play03:07

introduce A Spacetime unit architecture

play03:09

and we'll look at that towards the end

play03:10

of the video but basically it sounds

play03:12

like it creates sort of the idea of the

play03:14

entire video at once so while other

play03:17

models it seems like kind of go frame by

play03:19

frame this one has sort of an idea of

play03:21

what the whole thing is going to look

play03:22

like at the very beginning and there's a

play03:24

video stylization so here's a lady

play03:26

running this is the source video and the

play03:28

various craziness that you can make her

play03:30

into the same thing with a dog and a car

play03:34

and a bear cinemagraphs is the ability

play03:36

to animate only certain portions of the

play03:38

image like the smoke coming out of this

play03:40

train this is something that Runway ml I

play03:42

believe recently released and looks like

play03:44

Google is hot on their heels creating

play03:46

basically the same ability then we have

play03:48

video and painting So if a portion of an

play03:50

image is missing you're able to use AI

play03:52

to sort of guess at what that would look

play03:54

like I got to say so here where the hand

play03:55

comes in that is very interesting cuz

play03:57

that seems kind of advanced cuz notice

play03:59

in the beginning he throws the Green

play04:01

Leaf in the missing portion of the image

play04:03

and then you see him coming back to the

play04:06

image that we can see throwing a green

play04:07

leaf or two so it makes the assumption

play04:09

that hey the things there will also be

play04:12

green leaves interestingly enough though

play04:14

I do feel like I can spot a mistake here

play04:16

the leaves that are already on there are

play04:18

fresh looking as opposed to the cooked

play04:20

ones like they are on this side so it

play04:22

knows to put in the green leaves as the

play04:24

guy is throwing them for them to be

play04:25

fresh because it matches the fresh

play04:27

leaves here but it misses the point that

play04:28

hey these are cooked leaves and these

play04:30

are fresh but still it's very impressive

play04:33

that it's able to sort of to sort of

play04:34

guess at what's happening in that moment

play04:37

and this is where if you've been

play04:38

following some of the latest AI research

play04:40

this is where these neural Nets get a

play04:41

little bit weird well again come back to

play04:43

that at the end but how they are able to

play04:46

predict certain things like what happens

play04:47

here for example like no one codes it to

play04:50

know that this is probably a cake of

play04:52

some sort nobody tells it what this

play04:54

thing is it guesses from clues that it

play04:57

sees on screen but how does that is

play05:00

really really weird let's just say that

play05:02

this is pretty impressive so here we're

play05:04

able to change the clothes that the

play05:06

person is wearing throughout these shots

play05:07

while you know notice the hat and the

play05:09

face they kind of remain consistent

play05:10

across all the shots whereas the dress

play05:13

is changed based on a text prompt as you

play05:15

watch this think about where video

play05:18

production for movies and serial TV

play05:20

shows Etc where that's going to be in 5

play05:23

to 10 years will something like this

play05:24

allow everyday people sitting at home to

play05:27

create stunning Hollywood style movies

play05:29

with whatever characters they want

play05:31

whatever settings they want with'

play05:32

generated video and AI voices we can

play05:35

create a movie starting Hugh Hefner as a

play05:36

chicken for example so really fast this

play05:38

is another study called Beyond surface

play05:40

statistics out of hardw so this has

play05:41

nothing to do with the Google project

play05:44

that we're looking at but this paper

play05:45

tries to answer the question of how do

play05:48

these models how do they create images

play05:50

how do they create videos as you can see

play05:52

here it says these models are capable of

play05:54

synthesizing high quality images but it

play05:56

remains a mystery how these networks

play05:58

transform let's say the phrase car in

play06:00

the street into a picture of a car in a

play06:02

street so in other words when we type in

play06:04

this when a human person says draw a

play06:06

picture of a car in a street or a video

play06:08

of a car in a street how does that thing

play06:10

do it how does it translate that into a

play06:12

picture do they simply memorize

play06:14

superficial correlations between pixel

play06:16

values and words or are they learning

play06:18

something deeper such as the underlying

play06:20

model of objects such as cars roads and

play06:23

how they are typically positioned and

play06:25

there's a bit of a argument going on in

play06:27

the scientific Community about this so

play06:29

some AI scientists say all it is is just

play06:32

sort of surface level statistics they're

play06:34

just memorizing where these little

play06:36

pixels go and they're able to kind of

play06:38

reproduce certain images Etc and some

play06:40

people say well no there's something

play06:42

deeper going on here something new and

play06:44

surprising that these AI models are

play06:46

doing so what they did is they created a

play06:48

model that was fed nothing but 2D images

play06:51

so images of cars and people and ships

play06:54

Etc but that model it wasn't taught

play06:57

anything about depth like depth of field

play07:00

like where the foreground of an image is

play07:01

or where the background of an image is

play07:03

it wasn't taught about what the focus of

play07:05

the image is what a car is ETC and what

play07:07

they found is so here's kind of like the

play07:09

decoded image so this is kind of how it

play07:11

makes it from step one to finally step

play07:14

15 where as you can see you can see this

play07:16

is a car so a human being would be able

play07:18

to point at this and say that's a car

play07:20

what in the image is closest to you the

play07:23

person taking the image you say well

play07:24

probably this wheel is the closest right

play07:26

this is the the kind of the foreground

play07:28

this is the main object and that's kind

play07:29

of the background that's far far away

play07:31

and this is close right but the reason

play07:33

that you are able to look at this image

play07:35

and know that is because you've seen

play07:37

these objects in the real world in the

play07:39

3D world you can probably imagine how

play07:41

this image would look if you're standing

play07:43

off the side here looking at it from

play07:45

this direction this AI model that made

play07:47

this has no idea about any of that all

play07:49

it's seeing is a bunch of these 2D

play07:51

images just pixels arranged in a screen

play07:53

and yet when we dive into try to

play07:55

understand how it's building these

play07:57

images from scratch this is what we

play07:59

start to notice so early on when it's

play08:01

building this image this is kind of what

play08:04

the the depth of the image looks like so

play08:07

very early on it knows that sort of this

play08:10

thing is in the foreground it's closer

play08:13

to us and this right here the blue

play08:15

that's the background it's far from us

play08:17

now looking at this image you can't

play08:19

possibly tell what this is going to be

play08:21

you can't tell what this is going to be

play08:22

till much much later maybe here we can

play08:25

kind of begin to start seeing some of

play08:27

the lines that are in here but that's

play08:28

about it you you see like the wheels and

play08:31

maybe you could guess of what that is

play08:33

but here in the beginning you have no

play08:34

idea and yet the model knows that

play08:35

something right here is in the

play08:36

foreground something's in the background

play08:38

and towards the end it knows that this

play08:40

is closer this is close and this is far

play08:43

this is Salient object meaning like what

play08:44

is the focus what is the main object so

play08:46

it knows that the main object is here it

play08:48

doesn't know what a car is it doesn't

play08:49

know what an object is it just knows

play08:51

like this is the the focus of the image

play08:53

again only towards much later do we

play08:55

realize that yes in fact this is the car

play08:57

and so this is the conclusion of the

play08:58

paper our experiments provide evidence

play09:00

that stable diffusion model so this is

play09:02

an image generating model AI although

play09:05

solely trained on two-dimensional images

play09:07

contain an internal linear

play09:09

representation related to scene geometry

play09:11

so in other words after seeing thousands

play09:14

or millions of 2D images inside its

play09:17

neural network inside of its brain it

play09:19

seems like and again this is a lot of

play09:22

people sort of dispute this but some of

play09:24

these research makes it seem like it's

play09:26

developing its neural net that allows it

play09:29

to create a 3D representation of that

play09:32

image even though it's never been taught

play09:34

what 3D means it uncovers a salent

play09:37

object or sort of that main Center

play09:39

object that it needs to focus on versus

play09:41

the background of the image as well as

play09:43

information related to relative depth

play09:45

and these representations emerge early

play09:47

so before it starts painting the colors

play09:50

or the little shapes or the the wheels

play09:52

and the Shadows it first starts thinking

play09:54

about the 3D space on which it's going

play09:56

to start painting that image and here

play09:58

they say these results add a Nuance to

play10:00

the ongoing debates and there are a lot

play10:02

of ongoing debates about this about

play10:04

whether generative models so these AI

play10:06

models can learn more than just surface

play10:08

statistics in other words is there some

play10:11

sort of understanding that's going on

play10:13

maybe not like human understanding but

play10:15

is it just statistics or is there

play10:18

something deeper that's happening and

play10:21

this is Runway ml so this is the other

play10:23

one of the leading sort of text 2 image

play10:26

AI models and you might have seen the

play10:29

images so as you can see here this is

play10:31

what they're offering people have made

play10:33

full movies maybe not hour long but

play10:35

maybe 10 minutes 20 minute movies that

play10:38

are entirely generated by AI so as you

play10:41

can see here it's it's similar to what

play10:44

Google is offering although I got to say

play10:47

after looking at Google's work and then

play10:49

this one Google's does seem just a

play10:51

little bit more consistent I would say

play10:53

there seems to be a little bit less

play10:54

shifting and and shapes going on it's

play10:56

just a little bit more consistent across

play10:58

time time and they have a lot of the

play11:00

same thing like this stylization here

play11:01

from a reference video to this image

play11:04

that's like the style reference but the

play11:06

interesting thing here is this is in the

play11:08

last few months looks like December 2023

play11:11

Runway nml introduced something they

play11:13

call General World models and they're

play11:15

saying we believe the next major

play11:17

advancement in AI will come from systems

play11:19

that understand the visual world and its

play11:21

Dynamics they're starting a long-term

play11:23

research effort around what they call

play11:24

General World models so their whole idea

play11:27

is that instead of the video AI models

play11:30

creating little Clips here and there

play11:32

with little isolated subjects and

play11:34

movements that a better approach would

play11:36

be to actually use the neural networks

play11:39

and them building some sort of a world

play11:41

model to understand the images they're

play11:43

making and to actually utilize that to

play11:46

have it almost create like a little

play11:47

world so for example if you're creating

play11:49

a clip with multiple characters talking

play11:52

then the AI model would actually almost

play11:54

simulate that entire world with the with

play11:56

the rooms and the people and then the

play11:58

people would talk talk to each other and

play12:00

it would just take that clip but it

play12:01

would basically create much more than

play12:03

just a clip like if a bird is flying

play12:05

across the sky it would be simulating

play12:07

the wind and the physics and all that

play12:10

stuff to try to capture the movement of

play12:12

that bird to create realistic images and

play12:14

video so they're saying a world model is

play12:16

an AI system that builds an internal

play12:18

representation of an environment and it

play12:20

uses it to simulate future events within

play12:22

that environment so for example for Gen

play12:24

2 which is their model their video model

play12:27

to generate realistic short video it has

play12:29

developed some understanding of physics

play12:32

and motion card still very limited

play12:34

struggling of complex camera controls or

play12:36

object motions amongst other things but

play12:39

they believe and a lot of other

play12:40

researchers as well that this is sort of

play12:42

the next step for us to get better at

play12:45

creating video at teaching robots how to

play12:47

behave in the physical world like for

play12:49

example the nvidia's foundation agent

play12:51

then we need to create bigger models

play12:53

that simulate entire worlds and then

play12:55

from those worlds they pull out what we

play12:57

need whether that's an image or text or

play12:59

a robot's ability to open doors and pick

play13:02

up objects all right but now back to

play13:04

Lumiere A Spacetime diffusion model for

play13:06

video generation so here they have a

play13:08

number of examples for the text to video

play13:11

of image to video stylized generation

play13:14

Etc and so in lumier they're trying to

play13:16

build this text video diffusion model

play13:19

that can create videos that portray

play13:20

realistic diverse and coherent motion a

play13:23

pivotal challenge in video synthesis and

play13:25

so the new thing that they introduces

play13:26

the SpaceTime unet archit tecture that

play13:29

generates entire temporal duration of

play13:31

the video at once so in other words it

play13:33

sort of thinks through how the entire

play13:36

video going to look like in the

play13:37

beginning as opposed to existing video

play13:39

models so other video models which

play13:41

synthesize distant key frames followed

play13:43

by temporal super solution basically

play13:45

meaning they do it one at a time so they

play13:47

start with one and then create the

play13:49

others and they're saying that makes

play13:50

Global temporal consistency difficult

play13:52

meaning that the object as as you watch

play13:54

a video of it right it looks a certain

play13:55

way on the first second of the video but

play13:58

by second five is just completely

play14:00

different and so here basically they're

play14:01

comparing these two videos so imagin and

play14:03

rs so The Lumiere model as you can see

play14:06

here here sample a few clips and they're

play14:08

looking at the XT slice so the XT slice

play14:12

you can basically think of that as so

play14:14

for example in stocks you have you know

play14:16

the price of stock over time right so it

play14:18

kind of goes like this here the x is the

play14:22

spatial Dimension so where certain

play14:24

things are in space on the image versus

play14:26

T temporal the time so the X here is

play14:29

basically where we might be looking at

play14:30

the width of the image for example of

play14:33

any image in time and T the temporal is

play14:36

like how consistent is across time so as

play14:37

you can see hit this green line so we're

play14:38

just looking at this thing across the

play14:40

entire image and this is what that looks

play14:42

like so as you can see here this is

play14:44

going pretty well and then it kind of

play14:46

messes up and it kind of gets crazy here

play14:48

and then kind of goes back to doing okay

play14:50

whereas in Lumiere it's pretty pretty

play14:54

good I mean maybe some funkiness right

play14:56

there in one one frame but it's pretty

play14:58

good same thing here I mean this is as

play15:00

you can see here pretty good maybe you

play15:03

can say that there's a little bit of

play15:04

funkiness here but overall it's very

play15:06

good whereas in this image and video I

play15:09

mean as you can see here there's kind of

play15:11

like a lot of nonsense that's happening

play15:13

right and so here you can see like you

play15:15

can't tell how many legs it has if it's

play15:17

missing a leg Etc whereas in The Lumiere

play15:20

I mean I feel like the you know you can

play15:22

see each of the legs pretty distinctly

play15:25

and their position and it's remains

play15:27

consistent across time or at least

play15:29

consistently easy to see where they are

play15:31

but I got to say I can't wait to get my

play15:33

hands on it it looks like as of right

play15:35

now I don't see a way to access it this

play15:37

is just sort of a preview but hopefully

play15:40

they will open up for testing soon and

play15:42

we'll be able to get our hands on it and

play15:43

check it out and here interestingly

play15:45

enough they actually compare how well

play15:47

their performs against the other

play15:49

state-of-the-art models in the in the

play15:51

industry so the two that I'm familiar

play15:53

with is Pika and genan 2 those are the

play15:56

two that I've used and they're saying

play15:57

that their video video is preferred by

play15:59

users in both text to video and image to

play16:02

video generation so blue is theirs and

play16:05

the Baseline is the orange one so it

play16:07

seems like there are pretty big

play16:09

differences in every single one this

play16:11

seems like video quality I mean it beats

play16:13

out every single other one of these

play16:15

which which I believe this text

play16:17

alignment which here means probably how

play16:19

well the image how true it is to The

play16:21

Prompt right so if you type in a prompt

play16:24

how accurately it represents it so it

play16:26

looks like maybe image is the closest

play16:28

one but it beats out most of the other

play16:30

ones by quite a bit and then video

play16:32

quality of image to video it seems like

play16:34

it beats them out as well with Gen 2

play16:37

probably being the next best one and

play16:39

here they provide a side-by-side

play16:41

comparison so for example the first

play16:42

prompt is a sheep to the right of a wine

play16:45

glass so this is Pika which which not

play16:48

great CU there's no wine glass here's

play16:50

Gen 2 consistently putting it on the

play16:53

left anime diff which just has two

play16:55

glasses and maybe a reflection of a

play16:57

sheep image and video same thing so the

play16:59

glasses on the left zero scope no

play17:02

glasses that I can see although they

play17:03

have sheep and of course R so the Lumi

play17:06

the Google one is it seems like a nail

play17:08

it in every single one the glass is on

play17:10

the right although I got to say Gen 2 is

play17:12

is great although it confused the left

play17:14

and right but other than that I mean

play17:16

same if image and video actually

play17:18

although I feel like Gen 2 the quality

play17:20

is much better of the sheep cuz that's

play17:22

you know that's a good-looking sheep I

play17:24

should probably rephrase that that's a

play17:27

well rendered sheep how about that

play17:29

versus imagin I mean that's a weird

play17:32

looking thing there that could almost be

play17:33

a horse or a cow if you just look at the

play17:35

face and Google is again excellent

play17:38

here's teddy bear skating in Time Square

play17:41

this is Google this is imag again

play17:43

weirdness happening there and that's gen

play17:45

two again pretty good but I mean the the

play17:47

thing is facing away although here I

play17:49

just noticed so they they took skating

play17:51

to mean ice skates whereas here it looks

play17:53

like these are roller skates skateboard

play17:55

Etc and so it looks like in the study

play17:57

they just showed you two to things they

play17:59

say do you like the left or the right

play18:00

more based on motion and better quality

play18:03

well I got to say if you're an aspiring

play18:05

AI cinematographer then this is really

play18:08

good news consistent coherent images

play18:11

that are able to create near lifelike

play18:15

scenes at this point I mean I'm sure

play18:17

there's other people that'll complain

play18:18

about stuff but you got to realize how

play18:21

quickly the stuff is progressing just to

play18:23

give you an idea this is about a year

play18:26

ago or so this is what a I generated

play18:29

video looked like so can you tell that

play18:33

is improved just a little bit that's

play18:36

about a year I'm not sure exactly when

play18:37

this was done but I'm going to say a

play18:39

year year and a half ago and I mean this

play18:41

thing gets nightmarish so when I'm

play18:44

talking about weird blocky shapes things

play18:47

not being consistent across scenes like

play18:51

what are we even looking at

play18:53

here is this a mouth is this a building

play18:56

and here's kind of uh something from

play18:58

about 4 months ago from Pika Labs so as

play19:01

you can see here it's much better it's

play19:04

much more consistent right as you can

play19:06

see here humans again maybe they look a

play19:09

little bit weird but it's better it can

play19:11

put you in the moment if you're telling

play19:13

a story that's not necessarily about

play19:15

everything looking realistic something

play19:18

like this can be created pretty easily

play19:20

and since it's new it's novel people

play19:23

might be this might be a whole new

play19:25

movement a new genre of film making

play19:28

that's new exciting and never before

play19:30

seen and most importantly it's easy to

play19:33

create with a you know at home with a

play19:36

few AI tools and anybody out there with

play19:39

creative abilities with creative talent

play19:41

to tell the stories that they have in

play19:43

their mind without being limited

play19:46

financially by Capital they're going to

play19:49

be able to create AI voices they're

play19:51

going to be able to create AI footage

play19:54

maybe even have Chad GPT help them with

play19:56

some of the story writing and once more

play19:58

the sort of the next generation of

play20:00

things that we're seeing that people are

play20:01

working on is things like the similation

play20:04

where you create the characters and then

play20:06

you sort of let them loose in a world

play20:08

they get simulated with these they get

play20:10

sort of simulated so the stories kind of

play20:12

play out in the world and then you sort

play20:14

of pick and choose what to focus on

play20:17

which scenes and which characters you

play20:19

want to bring to the front so you

play20:21

basically act as the World Builder you

play20:24

build the worlds the characters the

play20:25

narratives and AI assists you in

play20:28

creating the visuals the voices Etc and

play20:31

you can be 100% in control of it or you

play20:34

can only control the things that you

play20:36

want and the AI generates the rest so to

play20:39

me this if you're interested in movie

play20:41

making and you like these sort of styles

play20:44

that by the way quickly will become much

play20:46

more realistic I would be really looking

play20:49

at this right now because right now is

play20:51

the time that it's sort of emerging into

play20:54

the world and getting really good and

play20:57

it's going to get better by next year

play20:59

it's going to be a lot

play21:01

better well my name is Wes rth and uh

play21:04

thank you for watching

Rate This

5.0 / 5 (0 votes)

Etiquetas relacionadas
LumièreGoogleAIText-to-VideoRealismoCoherenciaGeneración de VideoNeural NetsCinemagraphsSimulación de Mundos