AI art, explained

Vox
1 Jun 202213:32

Summary

TLDRThe video script explores the evolution of AI-generated images, from the early experiments in 2015 to the present capabilities of models like DALL-E. It discusses the concept of 'prompt engineering' and the creative potential unlocked by these technologies, while also highlighting the ethical and societal implications, including biases in training data and the impact on professional artists.

Takeaways

  • 🧠 The major development in AI research in 2015 was automated image captioning, where machine learning algorithms could label objects and describe them in natural language.
  • 🔄 Researchers became curious about the reverse process, text-to-image generation, aiming to create novel scenes that didn't exist in reality.
  • 🚀 Early attempts resulted in rudimentary images, but the 2016 paper showcased the potential for future advancements in this field.
  • 🌐 The technology has advanced dramatically in a short time, with AI now capable of generating images from text prompts that were previously unimaginable.
  • 🎨 AI-generated art, like the portrait sold for over $400,000 in 2018, required specific datasets and models to mimic styles, unlike the newer, more versatile models.
  • 📈 The newer models are so large that they can't be trained on individual computers, but once trained, they can generate a wide range of images from a simple text prompt.
  • 🤖 The process of communicating with deep learning models to generate images is known as 'prompt engineering', which involves refining the text prompts to get desired results.
  • 🖼️ The models use a 'latent space' to generate images, a mathematical space with meaningful clusters representing different concepts, rather than copying from the training data.
  • 🔮 The generative process involves starting with noise and arranging pixels into a coherent image through a process called diffusion, which adds an element of randomness.
  • 🌐 The technology raises copyright and ethical questions, as it can replicate styles and generate images from biased datasets found on the internet.
  • 🔑 This technology has the potential to change the way humans imagine, communicate, and work within their culture, with both positive and negative long-term consequences.

Q & A

  • What was the major development in AI research seven years ago that led to the concept of text-to-image generation?

    -The major development was automated image captioning, where machine learning algorithms could label objects in images and put those labels into natural language descriptions.

  • What was the initial challenge faced by researchers when they attempted to create images from text descriptions?

    -The initial challenge was to generate entirely novel scenes that didn't exist in the real world, rather than retrieving existing images like a search engine does.

  • Can you describe the first attempt to generate an image from the text prompt 'the red or green school bus'?

    -The first attempt resulted in a 32 by 32 tiny image that was barely recognizable, appearing as a blob of something on top of something else.

  • What is 'prompt engineering' and why is it significant in the context of text-to-image AI?

    -Prompt engineering is the craft of communicating with deep learning models by providing the right text prompts to generate desired images. It's significant because it allows users to refine their interaction with the machine, creating a dialog that guides the AI to produce specific outputs.

  • What is the significance of the 'latent space' in the context of deep learning models used for image generation?

    -The latent space is a multidimensional mathematical space where the deep learning model organizes variables that represent different aspects of images. It allows the model to generate new images that have not been seen before, based on the navigation within this space using text prompts.

  • How does the generative process called 'diffusion' work in creating an image from a point in the latent space?

    -Diffusion starts with noise and, over a series of iterations, arranges pixels into a composition that makes sense to humans. Due to randomness in the process, the same prompt will not always result in the exact same image.

  • What is the role of a large, diverse training dataset in training image-generating AI models?

    -A large, diverse training dataset provides the AI model with a wide range of images and their text descriptions, which helps the model learn to associate concepts with visual patterns and generate new images from text prompts.

  • What ethical and copyright concerns arise with the use of AI-generated images, especially when mimicking the style of known artists?

    -Ethical and copyright concerns include the use of artists' work in datasets without their consent and the potential for AI to mimic their style, which may infringe on their intellectual property rights and raise questions about originality and attribution.

  • How does the AI's latent space reflect societal biases and cultural representation?

    -The latent space of AI models contains biases and cultural representations based on the data they were trained on, often reflecting stereotypes and underrepresentation of certain groups or cultures, as it mirrors the content and biases found on the internet.

  • What potential long-term impacts does the advancement in text-to-image AI have on creators and the way humans imagine and communicate?

    -The advancement in text-to-image AI has the potential to revolutionize the way humans create, communicate, and interact with visual content. It may lead to new forms of artistic expression, challenges in copyright and originality, and changes in the job market for creators.

  • What is the significance of the name 'DALL-E' given to the AI model announced by OpenAI, and what does DALLE-2 promise?

    -DALL-E is named after the famous artist Salvador Dalí, reflecting the model's ability to create images from text prompts. DALLE-2, its successor, promises more realistic results and seamless editing capabilities, though it has not been released to the public yet.

Outlines

00:00

🤖 Evolution of AI Image Generation

This paragraph discusses the inception of AI in image captioning and the subsequent challenge of generating images from text descriptions. It details the initial experiments with generating novel scenes that had never been seen before, such as a green school bus, and the progression of this technology from rudimentary 32x32 pixel images to highly advanced and realistic outputs within a year. The rapid development is highlighted through examples of AI-generated art sold at auction and the capabilities of modern models like DALL-E and Midjourney, which can create images from text prompts without the need for traditional artistic tools.

05:01

🎨 The Art of Prompt Engineering in AI Imagery

The second paragraph delves into the intricacies of 'prompt engineering,' where the input text prompts are carefully crafted to guide AI models in generating specific images. It explores the creative aspect of this process, comparing it to having a dialog with a machine to refine the desired output. The paragraph also touches on the technical side, explaining how AI models learn from vast datasets to create images not by copying existing ones but by navigating a 'latent space'—a multidimensional mathematical space where points represent potential images. The generative process of 'diffusion' is introduced, which transforms noise into coherent images, and the uniqueness of each AI-generated image due to the randomness in the process is emphasized.

10:07

🔮 Ethical and Creative Implications of AI Image Generation

The final paragraph addresses the broader implications of AI-generated imagery, including the ethical considerations surrounding the use of artists' styles without their explicit consent and the unresolved copyright issues. It raises concerns about the biases present in the training data and the potential for these biases to be reflected in the generated images. The paragraph also contemplates the impact of this technology on human imagination and communication, suggesting that it represents a significant shift in how we interact with our own culture. Finally, it invites viewers to consider the future of professional image creators in the face of advancing AI technologies and encourages further discussion on the topic.

Mindmap

Keywords

💡Automated Image Captioning

Automated image captioning refers to the AI technology that generates descriptive text for images. It is a part of computer vision and natural language processing. In the context of the video, it is the precursor to the text-to-image generation technology discussed, where machine learning algorithms were initially used to label objects in images and then evolved to describe those objects in natural language.

💡Text-to-Images

Text-to-images is a concept where a computer model generates visual content based on textual descriptions. It is central to the video's theme, illustrating the evolution from image captioning to creating novel images from text prompts. The script mentions researchers' curiosity about generating images from text, leading to the creation of AI models like DALL-E.

💡Deep Learning Models

Deep learning models are a subset of machine learning algorithms that are composed of multiple layers, allowing them to learn complex patterns in data. In the video, these models are used to generate images from text prompts, with the script highlighting their ability to produce novel scenes that have never been seen before.

💡DALL-E

DALL-E is an AI model developed by OpenAI, named after the surrealist artist Salvador Dalí. It is capable of creating images from text descriptions. The video discusses the evolution of this technology, starting with DALL-E and moving to its successor, DALL-E 2, which promises more realistic image generation.

💡Prompt Engineering

Prompt engineering is the art of formulating text prompts to guide AI models in generating specific images. It is highlighted in the script as a skillful craft, where the right choice of words can significantly influence the outcome of the generated images, turning the process into an interactive dialogue between the user and the AI.

💡Latent Space

In the context of deep learning, latent space is a multidimensional mathematical space where data points are represented by their relationships rather than their raw input form. The video explains that the new images generated by AI do not come from copying the training data but from this latent space, where the AI finds variables to represent concepts like 'banana-ness' or '1960s photo texture'.

💡Diffusion

Diffusion, in the context of AI image generation, refers to the generative process that starts with noise and iteratively refines it into a coherent image. The script describes this as a key step in translating a point in the latent space into an actual image, with an element of randomness ensuring that the same prompt does not always produce the same image.

💡Midjourney

Midjourney is a company mentioned in the script that has created a community and technology for text-to-image generation. It operates on a Discord platform with bots that can turn text prompts into images quickly, demonstrating the practical application and accessibility of this technology.

💡Copyright Questions

Copyright questions pertain to the legal issues surrounding the use of copyrighted material in AI training and the generation of new content. The video raises concerns about the use of artists' styles and images in training datasets without their consent, and the potential for AI-generated images to infringe on existing copyrights.

💡Bias in AI

Bias in AI refers to the tendency of AI systems to reflect and perpetuate the biases present in their training data. The script points out that the latent space of AI models may contain biased associations learned from the internet, such as stereotypical representations of certain professions or cultural groups.

💡Cultural Representation

Cultural representation in AI is the extent to which AI models capture and reflect the diversity of human cultures. The video discusses the potential for bias in AI, noting that some cultures may not be represented at all in training datasets, leading to an incomplete or skewed reflection of the world in AI-generated content.

Highlights

In 2015, AI research saw a major development in automated image captioning, where machine learning algorithms began to describe images in natural language.

Researchers were curious about reversing the process to generate images from text, aiming to create novel scenes not found in the real world.

The initial experiments in text-to-image generation resulted in simple, abstract images that were far from realistic.

AI-generated images have come a long way in a short time, with dramatic improvements observed within just one year.

The technology now allows for the creation of images from text prompts without the need for traditional art tools.

Open AI announced DALL-E, a model capable of creating images from a wide range of text captions, with an even more advanced version, DALLE-2, on the horizon.

Independent developers have built their own text-to-image generators using pre-trained models, making this technology accessible to the public.

Midjourney, a company with a Discord community, allows users to generate images from text in under a minute using bots.

The art of communicating with deep learning models to generate desired images has been termed 'prompt engineering'.

The process of image generation from text involves a mathematical 'latent space' where the model finds variables to represent different concepts.

Deep learning models learn to distinguish between images by identifying patterns and variables that humans may not recognize.

The generative process called 'diffusion' is used to translate points in latent space into actual images, starting with noise and arranging pixels into a coherent composition.

The technology raises copyright and ethical questions, as it can replicate an artist's style without using their images, just by mentioning their name in the prompt.

The latent space of these models may contain biases and stereotypes learned from the internet, reflecting societal prejudices.

The technology's impact extends beyond the immediate technical consequences, potentially changing the way humans imagine, communicate, and work with their own culture.

The video includes a bonus section with insights from creative professionals on the implications of text-to-image technology for those who make a living creating images.

Transcripts

play00:00

Seven years ago, back in 2015,  

play00:02

one major development in AI research  was automated image captioning.

play00:07

Machine learning algorithms could  already label objects in images,  

play00:10

and now they learned to put those labels  into natural language descriptions.

play00:14

And it made one group of researchers curious.

play00:17

What if you flipped that process around?

play00:19

We could do image to text.

play00:22

Why not try doing text to  images and see how it works?

play00:26

It was a more difficult task.They didn’t want  

play00:28

to retrieve existing images  the way google search does.

play00:31

They wanted to generate entirely novel scenes that didn’t happen in the real world.

play00:35

So they asked their computer model for something it would have never seen before.

play00:39

Like all the school buses you've seen are yellow.

play00:42

But if you write “the red or green school bus”  would it actually try to generate something green?

play00:47

And it did that.

play00:51

It was a 32 by 32 tiny image.

play00:54

And then all you could see is like a  blob of something on top of something.

play00:58

They tried some other prompts like “A herd  of elephants flying in the blue skies”.

play01:02

“A vintage photo of a cat.”

play01:04

“A toilet seat sits open in the grass field.”

play01:07

And “a bowl of bananas is on the table.”

play01:11

Maybe not something to hang on your wall  but the 2016 paper from those researchers  

play01:15

showed the potential for what might  become possible in the future.

play01:19

And uh... the future has arrived.

play01:24

It is almost impossible to overstate how far  the technology has come in just one year.

play01:30

By leaps and bounds. Leaps and bounds.

play01:32

Yeah, it's been quite dramatic.

play01:36

I don’t know anyone who  hasn’t immediately been like

play01:39

“What is this? What is happening here?”

play01:46

Could I say like watching waves crashing?

play01:51

Party hat guy.

play01:52

Seafoam dreams.

play01:53

A coral reef. Cubism.

play01:54

Caterpillar.

play01:55

A dancing taco.

play01:56

My prompt is Salvador Dali painting  the skyline of New York City.

play02:02

You may be thinking, wait  AI-generated images aren’t new.

play02:06

You probably heard about this generated portrait  going for over $400,000 at auction back in 2018.

play02:12

Or this installation of morphing portraits,  which Sotheby’s sold the following year.

play02:17

It was created by Mario Klingemann, who  explained to me that that type of AI  

play02:21

art required him to collect a specific dataset of  images and train his own model to mimic that data.

play02:27

Let's say, Oh, I want to create landscapes,  so I collect a lot of landscape images.

play02:31

I want to create portraits,  I trained on portraits.

play02:34

But then the portrait model would not  really be able to create landscapes.

play02:38

Same with those hyper realistic  fake faces that have been plaguing  

play02:41

linkedin and facebook – those come from a  model that only knows how to make faces.

play02:46

Generating a scene from any combination of words  requires a different, newer, bigger approach.

play02:52

Now we kind of have these huge  models, which are so huge that  

play02:56

somebody like me actually cannot train  them anymore on their own computer.

play03:00

But once they are there, they are  really kind of— they contain everything.

play03:05

I mean, to a certain extent.

play03:07

What this means is that we can now  create images without having to actually  

play03:10

execute them with paint or  cameras or pen tools or code.

play03:14

The input is just a simple line of text.

play03:18

I'll get to how this tech works later in the video  

play03:21

but to understand how we got here,  we have to rewind to January 2021

play03:26

When a major AI company called Open AI announced  DALL-E – which they named after these guys.

play03:32

They said it could create images from text  captions for a wide range of concepts.

play03:36

They recently announced DALLE-2, which promises  more realistic results and seamless editing.

play03:42

But they haven’t released  either version to the public.

play03:45

So over the past year, a community of  independent, open-source developers  

play03:49

built text-to-image generators out of other  pre-trained models that they did have access to.

play03:54

And you can play with those online for free.

play03:56

Some of those developers are now working  for a company called Midjourney, 

play03:59

which created a Discord community with bots that  turn your text into images in less than a minute.

play04:06

Having basically no barrier to entry to  this has made it like a whole new ballgame.

play04:12

I've been up until like two  or three in the morning.

play04:14

Just really trying to change things, piece things together.

play04:18

I've done about 7,000 images. It’s ridiculous.

play04:21

MidJourney currently has a wait-list for  subscriptions, but we got a chance to try it out.

play04:29

"Go ahead and take a look."

play04:31

“Oh wow. That is so cool”

play04:36

“It has some work to do. I feel like it can  be — it’s not dancing and it could be better.”

play04:44

The craft of communicating  with these deep learning  

play04:46

models has been dubbed “prompt engineering”.

play04:49

What I love about prompting  for me, it's kind of really  

play04:52

that has something like magic where you have to  know the right words for that, for the spell.

play04:57

You realize that you can refine  the way you talk to the machine.

play05:00

It becomes a kind of a dialog.

play05:03

You can say like “octane render blender 3D”.

play05:06

Made with Unreal Engine...

play05:07

...certain types of film lenses and cameras...

play05:10

...1950s, 1960s...

play05:12

...dates are really good.

play05:14

...lino cut or wood cut...

play05:15

Coming up with funny pairings, like a Faberge Egg McMuffin.

play05:19

A monochromatic infographic poster about  typography depicting Chinese characters.

play05:24

Some of the most striking images  can come from prompting the model  

play05:27

to synthesize a long list of concepts.

play05:30

It's kind of like it's having a very strange  collaborator to bounce ideas off of and get  

play05:35

unpredictable ideas back.

play05:42

I love that!

play05:44

My prompt was "chasing seafoam dreams,"

play05:47

which is a lyric from the Ted Leo and the Pharmacists' song "Biomusicology."

play05:51

Can I use this as the album cover for my first album? "Absolutely."

play05:55

Alright.

play05:58

For an image generator to be able to  respond to so many different prompts, 

play06:01

it needs a massive, diverse training dataset.

play06:03

Like hundreds of millions of images scraped from  the internet, along with their text descriptions.

play06:08

Those captions come from things like the alt text  that website owners upload with their images,  

play06:12

for accessibility and for search engines.

play06:15

So that’s how the engineers  get these giant datasets.

play06:18

But then what do the models actually do with them?

play06:21

We might assume that when  we give them a text prompt,  

play06:24

like “a banana inside a snow globe from 1960."

play06:27

They search through the training data  to find related images and then copy  

play06:30

over some of those pixels. But  that’s not what’s happening.

play06:35

The new generated image doesn’t  come from the training data,  

play06:38

it comes from the “latent space”  of the deep learning model.

play06:41

That’ll make sense in a minute, first  let’s look at how the model learns.

play06:45

If I gave you these images and told you to match  them to these captions, you’d have no problem.

play06:50

But what about now, this is  what images look like to a  

play06:53

machine just pixel values for red green and blue.

play06:56

You’d just have to make a guess, and  that’s what the computer does too at first.

play07:00

But then you could go through  thousands of rounds of this  

play07:02

and never figure out how to get better at it.

play07:04

Whereas a computer can eventually figure out a  method that works- that’s what deep learning does.

play07:10

In order to understand that this arrangement  of pixels is a banana, and this arrangement  

play07:13

of pixels is a balloon, it looks for metrics that  help separate these images in mathematical space.

play07:19

So how about color? If we measure  the amount of yellow in the image,  

play07:23

that would put the banana over here and the  balloon over here in this one-dimensional space.

play07:28

But then what if we run into this:

play07:30

Now our yellowness metric isn’t very  good at separating bananas from balloons.

play07:34

We need a different variable.

play07:36

Let’s add an axis for roundness.

play07:38

Now we’ve got a two dimensional space with the  round balloons up here and the banana down here.

play07:44

But if we look at more data we may come  across a banana that’s pretty round,  

play07:47

and a balloon that isn’t.

play07:49

So maybe there’s some way to measure shininess.

play07:52

Balloons usually have a shiny spot.

play07:55

Now we have a three dimensional space.

play07:57

And ideally, when we get a new image we  can measure those 3 variables and see  

play08:01

whether it falls in the banana region  or the balloon region of the space.

play08:05

But what if we want our model to recognize,  

play08:07

not just bananas and balloons,  but…all these other things.

play08:10

Yellowness, roundness, and shininess don’t  capture what’s distinct about these objects.

play08:19

That’s what deep learning algorithms do  as they go through all the training data.

play08:22

They find variables that help improve their  performance on the task and in the process,  

play08:27

they build out a mathematical space  with way more than 3 dimensions.

play08:31

We are incapable of picturing multidimensional  space, but midjourney's model offered this and I like it.

play08:37

So we’ll say this represents the latent space of  the model. And It has more than 500 dimensions.

play08:43

Those 500 axes represent variables that  humans wouldn’t even recognize or have  

play08:48

names for but the result is that  the space has meaningful clusters:

play08:51

A region that captures the essence of banana-ness.

play08:54

A region that represents the textures  and colors of photos from the 1960s.

play08:59

An area for snow and an area for globes  and snowglobes somewhere in between.

play09:05

Any point in this space can be thought  of as the recipe for a possible image.

play09:10

The text prompt is what navigates us to that  location. But then there’s one more step.

play09:16

Translating a point in that mathematical  space into an actual image involves a  

play09:22

generative process called diffusion.  It starts with just noise and then,  

play09:26

over a series of iterations, arranges pixels  into a composition that makes sense to humans.

play09:33

Because of some randomness in the process,  

play09:34

it will never return exactly the  same image for the same prompt.

play09:38

And if you enter the prompt into a  different model designed by different  

play09:41

people and trained on different  data, you’ll get a different result.

play09:44

Because you’re in a different latent space.

play09:58

No way. That is so cool. What the heck? The brush  strokes, the color palette. That’s fascinating.

play10:06

I wish I could like — I mean he’s dead,  but go up to him and be like, "Look what I have!"

play10:14

Oh that’s pretty cool. Probably the  only Dali that I could afford anyways.”

play10:21

The ability of deep learning to extract  patterns from data means that you can copy an  

play10:25

artist’s style without copying their images,  just by putting their name in the prompt.

play10:32

James Gurney is an American illustrator who  

play10:34

became a popular reference for  users of text to image models.

play10:38

I asked him what kind of norms he would like  to see as prompting becomes widespread.

play10:43

I think it's only fair to  people looking at this work  

play10:45

that they should know what the prompt  was and also what software was used.

play10:50

Also I think the artists should be allowed  to opt in or opt out of having their work  

play10:54

that they worked so hard on by hand be used  as a dataset for creating this other artwork.

play10:59

James Gurney, I think he was a  great example of being someone  

play11:03

who was open to it, started  talking with the artists.

play11:07

But I also heard of other artists  who got actually extremely upset.

play11:13

The copyright questions regarding  the images that go into training the  

play11:16

models and the images that come out  of them…are completely unresolved.

play11:20

And those aren’t the only questions  that this technology will provoke.

play11:24

The latent space of these models contains some  

play11:26

dark corners that get scarier as  outputs become photorealistic.

play11:30

It also holds an untold number  of associations that we wouldn’t  

play11:33

teach our children but that  it learned from the internet.

play11:36

If you ask an image of the CEO,  it's like an old white guy.

play11:40

If you ask for images of  nurses, they're all like women.

play11:43

We don’t know exactly what’s in the  datasets used by OpenAI or Midjourney.

play11:47

But we know the internet is biased toward  the English language and western concepts,  

play11:51

with whole cultures not represented at all.

play11:53

In one open-sourced dataset,  

play11:55

the word “asian” is represented first  and foremost by an avalanche of porn.

play12:01

It really is just sort of an infinitely complex  mirror held up to our society and what we  

play12:08

deemed worthy enough to, you know, put  on the internet in the first place and  

play12:12

how we think about what we do put up.

play12:16

But what makes this technology so  unique is that it enables any of  

play12:19

us to direct the machine to  imagine what we want it to see.

play12:23

Party hat guy, space invader, caterpillar, and a ramen bowl.

play12:29

Prompting removes the obstacles between ideas  and images, and eventually videos, animations,  

play12:35

and whole virtual worlds.

play12:36

We are on a voyage here, that  is it's a bigger deal than  

play12:42

than just like one decade or the  immediate technical consequences.

play12:46

It's a change in the way humans imagine,  communicate, work with their own culture  

play12:50

And that will have long range,  good and bad consequences that we  

play12:56

we are just by definition, not going to  be capable of completely anticipating.

play13:05

Over the course of researching this video I spoke to a bunch of creative people

play13:09

who have played with these tools.

play13:11

And I asked them what they think this all means for people who make a living making images.

play13:16

The human artists and illustrators and designers and stock photographers out there.

play13:22

And they had a lot of interesting things to say.

play13:24

So I've compiled them into a bonus video.

play13:27

Please check it out and add your own thoughts in the comments. Thank you for watching.

Rate This

5.0 / 5 (0 votes)

Related Tags
AI ArtImage CaptioningDeep LearningText-to-ImageDALL-EMidjourneyPrompt EngineeringData BiasCreativityCopyright IssuesTech Innovation