How we teach computers to understand pictures | Fei Fei Li
Summary
TLDRFei-Fei Li discusses advancements in computer vision and artificial intelligence, highlighting the challenges of teaching machines to interpret visual information like humans. Through her work with Stanford's Vision Lab and the ImageNet project, Li illustrates how vast data sets help train computers to recognize objects, generate sentences, and understand complex visual scenes. Despite progress, machines still struggle with deeper comprehension. Li envisions a future where computers assist in healthcare, safety, and exploration, emphasizing the potential of AI to improve human life by augmenting our ability to see and understand the world.
Takeaways
- 👶 A three-year-old child can easily describe what they see in photos, demonstrating how natural it is for humans to interpret visual information.
- 🧠 Despite technological advancements, computers still struggle to interpret visual data in the way humans do because they lack true understanding.
- 🚗 Computer vision is essential for applications like self-driving cars, which need to differentiate between various objects to function safely.
- 👁️ Vision is not just about the eyes but involves complex brain processing, which has evolved over millions of years.
- 🔬 Fei-Fei Li's research at Stanford's Vision Lab focuses on teaching computers to see and understand like humans through computer vision and machine learning.
- 🐱 Simple object recognition for computers is challenging due to the infinite variations in appearance, positioning, and context of objects like cats.
- 📊 The ImageNet project, launched in 2007, created a massive dataset of labeled images to help train computer vision algorithms, drawing on millions of images sourced from the internet.
- 💡 The combination of big data (ImageNet) and convolutional neural networks (a type of machine learning algorithm) has led to significant progress in object recognition.
- 🧩 Computer vision algorithms have evolved from recognizing individual objects to generating human-like sentences that describe entire scenes.
- 🤖 Although there have been advancements, current AI still struggles with more nuanced understanding, like context, emotions, or cultural significance in images.
Q & A
What is the main task that a three-year-old child is an expert at, according to Fei-Fei Li?
-A three-year-old child is an expert at making sense of what they see, describing the world based on visual perception.
What is the current limitation of advanced machines and computers, despite technological progress?
-Despite technological progress, advanced machines and computers still struggle with understanding and interpreting visual information like humans do.
Why is it difficult for computers to interpret visual information, such as distinguishing a crumpled paper bag from a rock on the road?
-It's difficult because computers do not naturally understand the meaning behind visual data. Cameras capture pixels, but those pixels lack the semantic meaning needed to interpret complex situations accurately.
How does Fei-Fei Li's research aim to improve computer vision?
-Fei-Fei Li's research aims to teach computers to see and understand visual information by leveraging large datasets and machine learning algorithms, similar to how a child learns from real-world experiences.
What was the significance of the ImageNet project in advancing computer vision?
-The ImageNet project provided an extensive dataset of 15 million labeled images, enabling computers to learn from a vast range of visual examples and significantly improving the accuracy of object recognition algorithms.
Why did Fei-Fei Li emphasize the importance of providing computers with 'training data' similar to what a child experiences?
-She emphasized that instead of focusing solely on improving algorithms, it's crucial to expose computers to large quantities of real-world examples, just like a child who learns by seeing millions of images throughout early development.
What role did convolutional neural networks play in advancing computer vision?
-Convolutional neural networks, which mimic the structure of the human brain with layers of interconnected neurons, became a breakthrough architecture in computer vision, enabling better object recognition when trained with the massive data from ImageNet.
What limitations still exist in current computer vision systems, as demonstrated in the TED talk?
-Current computer vision systems still make mistakes, such as confusing objects like a toothbrush for a baseball bat or misinterpreting artistic images. These limitations show that computers are far from understanding the world with the nuance and depth of human perception.
How does Fei-Fei Li envision the future of visual intelligence in machines?
-She envisions a future where machines collaborate with humans, assisting in tasks like diagnosing patients, navigating disaster zones, and discovering new materials. Machines with visual intelligence will enhance human capabilities in ways previously unimaginable.
What example does Fei-Fei Li give to illustrate the deeper understanding that computers currently lack in visual perception?
-She gives the example of her son Leo's birthday cake picture. While a computer can identify objects like 'a person and a cake,' it lacks the deeper context—such as knowing the cake is an Italian Easter cake or understanding the boy's emotional connection to his shirt, which was a gift.
Outlines
👀 The Challenge of Teaching Computers to See
The paragraph introduces the concept of computer vision as a frontier in computer science, comparing the human ability to make sense of visual information with the struggle that advanced machines face in performing the same task. It highlights the importance of computer vision in various applications such as self-driving cars, environmental monitoring, and security but points out the limitations in current technology. The speaker, Fei-Fei Li, gives an overview of her research journey in computer vision and machine learning, emphasizing the need to teach computers to see objects as a foundational step towards achieving artificial intelligence that can understand and interpret visual data like humans.
🐱 The Complexity of Object Recognition
This paragraph delves into the complexity of recognizing objects, using the example of a cat to illustrate the challenge. It discusses how early attempts to model objects were simplistic and failed to account for the vast variations in appearance and perspective. The speaker then shares a pivotal realization that children learn to see through experience and exposure to a vast number of real-world examples. This insight led to the creation of the ImageNet project, which aimed to amass a large dataset of labeled images to train computer vision algorithms. The project's success in collecting and labeling millions of images is detailed, along with the challenges faced in securing funding and support for this novel approach.
🧠 The Neural Network Revolution in Computer Vision
The paragraph explains the architecture of neural networks, drawing an analogy between the brain's neurons and the nodes in a neural network. It discusses how these networks are organized in layers, similar to the brain's structure, and how they are trained using massive datasets like ImageNet. The paragraph highlights the breakthroughs in object recognition that were achieved through the use of convolutional neural networks (CNNs), which were fed by the extensive data from ImageNet. The speaker describes how these algorithms can now identify objects in images with a high degree of accuracy and even generate descriptions of scenes, marking a significant advancement in computer vision.
🚀 Advancing from Object Recognition to Scene Understanding
In this paragraph, the speaker discusses the next steps in computer vision: teaching computers not just to recognize objects but to understand the context and narrative of a scene. The paragraph describes the integration of visual data with natural language processing to generate descriptive sentences about images. The speaker shares examples of the computer's progress, including both its successes and its humorous mistakes. The paragraph concludes with a vision for the future where computers with visual intelligence can collaborate with humans, enhancing our capabilities in various fields such as medicine, transportation, and exploration.
Mindmap
Keywords
💡Computer Vision
💡Machine Learning
💡ImageNet
💡Convolutional Neural Networks (CNNs)
💡Big Data
💡Object Recognition
💡Neural Networks
💡Deep Learning
💡Algorithms
💡Artificial Intelligence (AI)
💡Data Annotation
Highlights
A three-year-old child describes images, highlighting the innate ability of humans to make sense of visual information.
Despite technological advancements, machines still struggle with basic visual understanding tasks that even young children can perform.
Fei-Fei Li discusses the challenges of computer vision, emphasizing the complexity of visual processing in machines compared to human brains.
Computer vision involves teaching machines to recognize objects, people, and understand relationships, emotions, and actions from visual data.
The ImageNet project was launched in 2007 to provide a large dataset of labeled images to improve machine learning algorithms for object recognition.
ImageNet collected nearly a billion images from the internet and used crowdsourcing to label these images, creating one of the largest datasets of its kind.
The convolutional neural network (CNN), an algorithm inspired by the human brain, became a successful approach for object recognition when combined with ImageNet data.
CNNs consist of millions of interconnected nodes organized in hierarchical layers, which process visual data similarly to how the human brain functions.
The ImageNet dataset allowed CNNs to achieve remarkable results in identifying objects in images, leading to significant advances in computer vision.
Computer vision models can now generate sentences that describe images, showing progress towards integrating vision and language in machines.
Despite advances, current computer vision models still make mistakes, such as misidentifying objects due to insufficient or biased training data.
The next step in computer vision is to move beyond object recognition to understanding context, stories, and complex scenes as humans do.
Fei-Fei Li envisions a future where machines with visual intelligence assist in healthcare, transportation, disaster response, and exploration.
The goal of computer vision is not just to create intelligent machines, but to collaborate with them to enhance human capabilities and explore new possibilities.
Fei-Fei Li emphasizes her personal motivation to advance computer vision: to create a better future for the next generation, represented by her son Leo.
Transcripts
Let me show you something.
(Video) Girl: Okay, that's a cat sitting in a bed.
The boy is petting the elephant.
Those are people that are going on an airplane.
That's a big airplane.
Fei-Fei Li: This is a three-year-old child
describing what she sees in a series of photos.
She might still have a lot to learn about this world,
but she's already an expert at one very important task:
to make sense of what she sees.
Our society is more technologically advanced than ever.
We send people to the moon, we make phones that talk to us
or customize radio stations that can play only music we like.
Yet, our most advanced machines and computers
still struggle at this task.
So I'm here today to give you a progress report
on the latest advances in our research in computer vision,
one of the most frontier and potentially revolutionary
technologies in computer science.
Yes, we have prototyped cars that can drive by themselves,
but without smart vision, they cannot really tell the difference
between a crumpled paper bag on the road, which can be run over,
and a rock that size, which should be avoided.
We have made fabulous megapixel cameras,
but we have not delivered sight to the blind.
Drones can fly over massive land,
but don't have enough vision technology
to help us to track the changes of the rainforests.
Security cameras are everywhere,
but they do not alert us when a child is drowning in a swimming pool.
Photos and videos are becoming an integral part of global life.
They're being generated at a pace that's far beyond what any human,
or teams of humans, could hope to view,
and you and I are contributing to that at this TED.
Yet our most advanced software is still struggling at understanding
and managing this enormous content.
So in other words, collectively as a society,
we're very much blind,
because our smartest machines are still blind.
"Why is this so hard?" you may ask.
Cameras can take pictures like this one
by converting lights into a two-dimensional array of numbers
known as pixels,
but these are just lifeless numbers.
They do not carry meaning in themselves.
Just like to hear is not the same as to listen,
to take pictures is not the same as to see,
and by seeing, we really mean understanding.
In fact, it took Mother Nature 540 million years of hard work
to do this task,
and much of that effort
went into developing the visual processing apparatus of our brains,
not the eyes themselves.
So vision begins with the eyes,
but it truly takes place in the brain.
So for 15 years now, starting from my Ph.D. at Caltech
and then leading Stanford's Vision Lab,
I've been working with my mentors, collaborators and students
to teach computers to see.
Our research field is called computer vision and machine learning.
It's part of the general field of artificial intelligence.
So ultimately, we want to teach the machines to see just like we do:
naming objects, identifying people, inferring 3D geometry of things,
understanding relations, emotions, actions and intentions.
You and I weave together entire stories of people, places and things
the moment we lay our gaze on them.
The first step towards this goal is to teach a computer to see objects,
the building block of the visual world.
In its simplest terms, imagine this teaching process
as showing the computers some training images
of a particular object, let's say cats,
and designing a model that learns from these training images.
How hard can this be?
After all, a cat is just a collection of shapes and colors,
and this is what we did in the early days of object modeling.
We'd tell the computer algorithm in a mathematical language
that a cat has a round face, a chubby body,
two pointy ears, and a long tail,
and that looked all fine.
But what about this cat?
(Laughter)
It's all curled up.
Now you have to add another shape and viewpoint to the object model.
But what if cats are hidden?
What about these silly cats?
Now you get my point.
Even something as simple as a household pet
can present an infinite number of variations to the object model,
and that's just one object.
So about eight years ago,
a very simple and profound observation changed my thinking.
No one tells a child how to see,
especially in the early years.
They learn this through real-world experiences and examples.
If you consider a child's eyes
as a pair of biological cameras,
they take one picture about every 200 milliseconds,
the average time an eye movement is made.
So by age three, a child would have seen hundreds of millions of pictures
of the real world.
That's a lot of training examples.
So instead of focusing solely on better and better algorithms,
my insight was to give the algorithms the kind of training data
that a child was given through experiences
in both quantity and quality.
Once we know this,
we knew we needed to collect a data set
that has far more images than we have ever had before,
perhaps thousands of times more,
and together with Professor Kai Li at Princeton University,
we launched the ImageNet project in 2007.
Luckily, we didn't have to mount a camera on our head
and wait for many years.
We went to the Internet,
the biggest treasure trove of pictures that humans have ever created.
We downloaded nearly a billion images
and used crowdsourcing technology like the Amazon Mechanical Turk platform
to help us to label these images.
At its peak, ImageNet was one of the biggest employers
of the Amazon Mechanical Turk workers:
together, almost 50,000 workers
from 167 countries around the world
helped us to clean, sort and label
nearly a billion candidate images.
That was how much effort it took
to capture even a fraction of the imagery
a child's mind takes in in the early developmental years.
In hindsight, this idea of using big data
to train computer algorithms may seem obvious now,
but back in 2007, it was not so obvious.
We were fairly alone on this journey for quite a while.
Some very friendly colleagues advised me to do something more useful for my tenure,
and we were constantly struggling for research funding.
Once, I even joked to my graduate students
that I would just reopen my dry cleaner's shop to fund ImageNet.
After all, that's how I funded my college years.
So we carried on.
In 2009, the ImageNet project delivered
a database of 15 million images
across 22,000 classes of objects and things
organized by everyday English words.
In both quantity and quality,
this was an unprecedented scale.
As an example, in the case of cats,
we have more than 62,000 cats
of all kinds of looks and poses
and across all species of domestic and wild cats.
We were thrilled to have put together ImageNet,
and we wanted the whole research world to benefit from it,
so in the TED fashion, we opened up the entire data set
to the worldwide research community for free.
(Applause)
Now that we have the data to nourish our computer brain,
we're ready to come back to the algorithms themselves.
As it turned out, the wealth of information provided by ImageNet
was a perfect match to a particular class of machine learning algorithms
called convolutional neural network,
pioneered by Kunihiko Fukushima, Geoff Hinton, and Yann LeCun
back in the 1970s and '80s.
Just like the brain consists of billions of highly connected neurons,
a basic operating unit in a neural network
is a neuron-like node.
It takes input from other nodes
and sends output to others.
Moreover, these hundreds of thousands or even millions of nodes
are organized in hierarchical layers,
also similar to the brain.
In a typical neural network we use to train our object recognition model,
it has 24 million nodes,
140 million parameters,
and 15 billion connections.
That's an enormous model.
Powered by the massive data from ImageNet
and the modern CPUs and GPUs to train such a humongous model,
the convolutional neural network
blossomed in a way that no one expected.
It became the winning architecture
to generate exciting new results in object recognition.
This is a computer telling us
this picture contains a cat
and where the cat is.
Of course there are more things than cats,
so here's a computer algorithm telling us
the picture contains a boy and a teddy bear;
a dog, a person, and a small kite in the background;
or a picture of very busy things
like a man, a skateboard, railings, a lampost, and so on.
Sometimes, when the computer is not so confident about what it sees,
we have taught it to be smart enough
to give us a safe answer instead of committing too much,
just like we would do,
but other times our computer algorithm is remarkable at telling us
what exactly the objects are,
like the make, model, year of the cars.
We applied this algorithm to millions of Google Street View images
across hundreds of American cities,
and we have learned something really interesting:
first, it confirmed our common wisdom
that car prices correlate very well
with household incomes.
But surprisingly, car prices also correlate well
with crime rates in cities,
or voting patterns by zip codes.
So wait a minute. Is that it?
Has the computer already matched or even surpassed human capabilities?
Not so fast.
So far, we have just taught the computer to see objects.
This is like a small child learning to utter a few nouns.
It's an incredible accomplishment,
but it's only the first step.
Soon, another developmental milestone will be hit,
and children begin to communicate in sentences.
So instead of saying this is a cat in the picture,
you already heard the little girl telling us this is a cat lying on a bed.
So to teach a computer to see a picture and generate sentences,
the marriage between big data and machine learning algorithm
has to take another step.
Now, the computer has to learn from both pictures
as well as natural language sentences
generated by humans.
Just like the brain integrates vision and language,
we developed a model that connects parts of visual things
like visual snippets
with words and phrases in sentences.
About four months ago,
we finally tied all this together
and produced one of the first computer vision models
that is capable of generating a human-like sentence
when it sees a picture for the first time.
Now, I'm ready to show you what the computer says
when it sees the picture
that the little girl saw at the beginning of this talk.
(Video) Computer: A man is standing next to an elephant.
A large airplane sitting on top of an airport runway.
FFL: Of course, we're still working hard to improve our algorithms,
and it still has a lot to learn.
(Applause)
And the computer still makes mistakes.
(Video) Computer: A cat lying on a bed in a blanket.
FFL: So of course, when it sees too many cats,
it thinks everything might look like a cat.
(Video) Computer: A young boy is holding a baseball bat.
(Laughter)
FFL: Or, if it hasn't seen a toothbrush, it confuses it with a baseball bat.
(Video) Computer: A man riding a horse down a street next to a building.
(Laughter)
FFL: We haven't taught Art 101 to the computers.
(Video) Computer: A zebra standing in a field of grass.
FFL: And it hasn't learned to appreciate the stunning beauty of nature
like you and I do.
So it has been a long journey.
To get from age zero to three was hard.
The real challenge is to go from three to 13 and far beyond.
Let me remind you with this picture of the boy and the cake again.
So far, we have taught the computer to see objects
or even tell us a simple story when seeing a picture.
(Video) Computer: A person sitting at a table with a cake.
FFL: But there's so much more to this picture
than just a person and a cake.
What the computer doesn't see is that this is a special Italian cake
that's only served during Easter time.
The boy is wearing his favorite t-shirt
given to him as a gift by his father after a trip to Sydney,
and you and I can all tell how happy he is
and what's exactly on his mind at that moment.
This is my son Leo.
On my quest for visual intelligence,
I think of Leo constantly
and the future world he will live in.
When machines can see,
doctors and nurses will have extra pairs of tireless eyes
to help them to diagnose and take care of patients.
Cars will run smarter and safer on the road.
Robots, not just humans,
will help us to brave the disaster zones to save the trapped and wounded.
We will discover new species, better materials,
and explore unseen frontiers with the help of the machines.
Little by little, we're giving sight to the machines.
First, we teach them to see.
Then, they help us to see better.
For the first time, human eyes won't be the only ones
pondering and exploring our world.
We will not only use the machines for their intelligence,
we will also collaborate with them in ways that we cannot even imagine.
This is my quest:
to give computers visual intelligence
and to create a better future for Leo and for the world.
Thank you.
(Applause)
関連動画をさらに表示
How Computer Vision Applications Work
Khan Academy and Code.org | What Makes a Computer, a Computer?
Capire l'intelligenza artificiale con la filosofia: conversazione con Cosimo Accoto
Computer Vision Explained in 5 Minutes | AI Explained
World Changing: Data Science and AI | Fred Blackburn | TEDxWakeForestU
Introduction to Artificial Intelligence
5.0 / 5 (0 votes)