Computer Vision: Crash Course Computer Science #35
Summary
TLDR本视频介绍了计算机视觉的重要性和基本原理。计算机视觉旨在使计算机能够从数字图像和视频中提取高层次的理解。视频首先探讨了像素和RGB颜色模型,然后介绍了如何通过颜色跟踪算法来追踪图像中的对象。接着,讨论了如何使用卷积核来识别图像中的边缘和其他特征,例如垂直边缘和人脸特征。此外,视频还介绍了卷积神经网络(CNN)的概念,这是一种能够通过多层卷积来识别复杂图像特征的深度学习技术。最后,视频讨论了计算机视觉在面部识别、情感识别和手势识别等领域的应用,并展望了计算机视觉技术如何改变我们与计算机的互动方式。
Takeaways
- 👀 计算机视觉是计算机科学的一个子领域,目标是让计算机能够从数字图像和视频中提取高层次的理解。
- 📷 计算机非常擅长捕捉具有极高保真度和细节的照片,但拍照并不等于“看”。
- 🔍 最简单的计算机视觉算法之一是追踪具有特定颜色的对象,例如一个亮粉色的球。
- 🌈 图像在计算机中通常以像素网格的形式存储,每个像素由红色、绿色和蓝色的组合定义,称为RGB值。
- 🔳 灰度图像转换可以帮助简化算法,例如在寻找垂直边缘时。
- 📏 通过使用核心(kernel)或过滤器,计算机视觉算法可以识别图像中的边缘和其他特征。
- 🤖 无人机导航等应用可以通过识别图像中的边缘来帮助安全地避开障碍物。
- 🧠 卷积神经网络(CNN)是当前深度学习领域的热门算法,它们可以通过学习识别图像中的有趣特征。
- 👥 人脸识别算法可以识别照片中的人脸,并通过面部标记点来确定眼睛是否睁开、眉毛的位置等。
- 😀 情绪识别算法可以解释面部表情,推断出人的情绪状态,如快乐、悲伤、沮丧或困惑。
- 🔑 生物特征数据,如面部几何形状,允许计算机识别个人,应用于智能手机解锁或政府监控。
- 🤲 手势和全身跟踪的最新进展使计算机能够解释用户的身体语言和手势。
Q & A
计算机视觉的目标是什么?
-计算机视觉的目标是赋予计算机从数字图像和视频中提取高层次理解的能力。
为什么说计算机在捕捉照片方面比人类做得更好?
-计算机在捕捉照片方面比人类做得更好,因为它们能够以难以置信的真实度和细节捕捉照片,尽管如此,拍照并不等同于真正地“看”。
在计算机视觉中,最简单的算法是什么,它如何工作?
-最简单的计算机视觉算法之一是追踪一个有颜色的物体,例如一个鲜亮的粉红色球。算法首先记录球的颜色,即中心像素的RGB值,然后通过比较图像中每个像素与目标颜色的差异来找到最匹配的像素。
为什么基于颜色标记的跟踪算法在实际应用中很少使用?
-基于颜色标记的跟踪算法很少使用,因为它们容易受到光照变化、阴影和其他效果的影响,而且在环境不能严格控制的情况下,算法的跟踪效果可能会很差。
什么是卷积操作,它在图像处理中扮演什么角色?
-卷积操作是将一个称为核或滤波器的矩阵应用于图像中的像素块。通过卷积,可以识别图像中的边缘、形状和其他特征,是图像处理和计算机视觉中的一种基本操作。
Prewitt算子是什么,它们在图像处理中有什么作用?
-Prewitt算子是一种用于增强图像边缘的核,它们能够突出显示图像中的垂直边缘和水平边缘。这些算子以它们的发明者命名,是计算机视觉中用于图像变换的众多核中的两个例子。
卷积神经网络(CNN)是如何工作的?
-卷积神经网络使用一系列神经元来处理图像数据,每个神经元都相当于一个滤波器,能够识别图像中的有趣特征。与预定义的核不同,神经网络可以学习自己的有用核。CNN通过多层神经元处理数据,每一层都对输入图像进行卷积,逐渐构建起对复杂对象和场景的识别。
为什么卷积神经网络通常需要很多层?
-卷积神经网络通常需要很多层,以便识别复杂对象和场景。每一层都对前一层的输出进行进一步的卷积处理,从而逐步提高识别的复杂性,这种技术被认为是深度学习。
计算机视觉中的面部识别技术可以用于哪些应用?
-面部识别技术可以用于智能手机的自动解锁、政府使用CCTV摄像头追踪人员、智能电视和智能辅导系统响应手势和情感等多种应用。
面部标记跟踪如何帮助计算机理解人的情感?
-面部标记跟踪可以捕捉到面部的几何特征,如眼睛之间的距离和前额的高度。这些数据可以用于确定眼睛是否睁开、眉毛的位置以及嘴巴的形状,从而推断出人的情感状态,如快乐、悲伤、沮丧或困惑。
手势和身体语言的跟踪对计算机视觉意味着什么?
-手势和身体语言的跟踪使计算机能够解释用户的身体语言和手势,这为创建响应手势和情感的新型交互体验,如智能电视和智能辅导系统,提供了可能。
计算机视觉领域的最新进展是什么?
-计算机视觉领域的最新进展包括在硬件层面上工程师们建造更好的相机,以及在软件层面上开发更先进的算法来处理像素数据,识别面部、手势等。此外,还有研究人员在构建新颖的交互体验,如智能电视和智能辅导系统,这些系统能够响应手势和情感。
Outlines
👀 计算机视觉简介
Carrie Anne介绍了计算机视觉的重要性,它是如何帮助我们准备食物、避开障碍物、阅读路标等。计算机视觉的目标是让计算机能够从数字图像和视频中提取高层次的理解。尽管计算机在捕捉图像方面非常出色,但它们并不等同于真正“看见”。视频还介绍了图像在计算机中是如何以像素网格的形式存储的,每个像素由红绿蓝三种颜色的组合来定义。接着,她通过追踪一个颜色对象(如粉红色球)的简单算法,解释了如何记录球的颜色并找到与之最匹配的像素。此外,还讨论了由于光照变化、阴影等因素,颜色跟踪算法的局限性。
🔍 边缘检测与卷积运算
视频进一步探讨了如何识别图像中的特征,如物体的边缘。通过将图像转换为灰度,并使用一个寻找垂直边缘的算法,解释了边缘是如何由多个像素组成的。介绍了如何使用一个称为“核”或“滤波器”的数学工具来识别边缘。通过将核与像素块相乘并求和,这个过程称为卷积。举例说明了如何使用不同的核来增强图像的边缘,如Prewitt算子,以及如何使用核来识别特定形状。
🧠 卷积神经网络与面部识别
Carrie Anne讨论了如何使用卷积神经网络(CNN)来识别图像中的特征,如面部。CNN通过多层神经元处理图像数据,每一层都能识别不同的特征,从简单的边缘到复杂的对象,如面部。这些网络能够学习自己的核,从而识别图像中的有趣特征。此外,还介绍了如何利用面部识别技术来确定面部标志点,如鼻子的尖端和嘴角,以及如何使用这些数据来检测情绪和进行面部识别。最后,她提到了计算机视觉在手势和身体语言识别方面的进展,以及这些技术如何改变我们与计算机的互动方式。
🌟 计算机视觉的未来
视频最后展望了计算机视觉的未来,指出计算机具有类似人类的视力将如何彻底改变我们与它们的互动方式。Carrie Anne提到,除了视觉,计算机如果能听和说,将会更加完美。她预告了下周将讨论计算机听觉和语言能力的主题,并邀请观众届时收看。
Mindmap
Keywords
💡计算机视觉
💡像素
💡卷积神经网络
💡特征检测
💡边缘检测
💡内核/滤波器
💡人脸识别
💡情感识别
💡生物特征识别
💡手势识别
💡深度学习
Highlights
计算机视觉领域旨在赋予计算机从数字图像和视频中提取高层次理解的能力。
计算机在捕获照片方面比人类更精确,但拍照并不等同于真正的“看见”。
图像在计算机中通常以像素网格的形式存储,每个像素由红绿蓝三种颜色组合定义。
最简单的计算机视觉算法之一是追踪一个有颜色的物体,例如一个鲜粉色的球。
通过记录球的RGB值,计算机程序可以找到与目标颜色最接近的像素。
算法可以应用于视频的每一帧,从而实现随时间追踪物体。
由于光照、阴影等影响,物体的颜色可能会有所变化,导致追踪效果不佳。
计算机视觉算法需要考虑像素块(patches)来识别比单个像素大的特征,如物体的边缘。
通过使用核(kernel)或过滤器,算法可以定义像素成为垂直边缘的可能性。
核操作,或称为卷积,是通过将核应用于像素块来进行的。
不同的核可以用于不同的图像转换,如锐化或模糊图像。
卷积神经网络(CNN)使用神经元的集合来处理图像数据,每个神经元输出一个由不同学习到的核处理的新图像。
CNN通过多层卷积能够识别复杂对象和场景,这是深度学习的一部分。
面部识别算法可以应用于识别人脸、手势和身体语言,进而推断情绪状态。
面部标记跟踪可以捕捉面部的几何形状,形成生物特征数据,用于个人识别。
计算机视觉技术的应用广泛,包括条形码扫描、自动驾驶车辆、智能电视等。
计算机视觉的进步使得计算机能够更智能地适应环境,提供上下文感知的交互体验。
计算机视觉领域的研究活跃,每年都有新的突破,预示着计算机将拥有更接近人类的视力。
Transcripts
Hi, I’m Carrie Anne, and welcome to Crash Course Computer Science!
Today, let’s start by thinking about how important vision can be.
Most people rely on it to prepare food, walk around obstacles, read street signs, watch
videos like this, and do hundreds of other tasks.
Vision is the highest bandwidth sense, and it provides a firehose of information about
the state of the world and how to act on it.
For this reason, computer scientists have been trying to give computers vision for half
a century, birthing the sub-field of computer vision.
Its goal is to give computers the ability to extract high-level understanding from digital
images and videos.
As everyone with a digital camera or smartphone knows, computers are already really good at
capturing photos with incredible fidelity and detail – much better than humans in fact.
But as computer vision professor Fei-Fei Li recently said, “Just like to hear is the
not the same as to listen.
To take pictures is not the same as to see.”
INTRO
As a refresher, images on computers are most often stored as big grids of pixels.
Each pixel is defined by a color, stored as a combination of three additive primary colors:
red, green and blue.
By combining different intensities of these three colors, what’s called a RGB value,
we can represent any color.
Perhaps the simplest computer vision algorithm – and a good place to start – is to track
a colored object, like a bright pink ball.
The first thing we need to do is record the ball’s color.
For that, we’ll take the RGB value of the centermost pixel.
With that value saved, we can give a computer program an image, and ask it to find the pixel
with the closest color match.
An algorithm like this might start in the upper right corner, and check each pixel,
one at time, calculating the difference from our target color.
Now, having looked at every pixel, the best match is very likely a pixel from our ball.
We’re not limited to running this algorithm on a single photo; we can do it for every
frame in a video, allowing us to track the ball over time.
Of course, due to variations in lighting, shadows, and other effects, the ball on the
field is almost certainly not going to be the exact same RGB value as our target color,
but merely the closest match.
In more extreme cases, like at a game at night, the tracking might be poor.
And if one of the team's jerseys used the same color as the ball, our algorithm would
get totally confused.
For these reasons, color marker tracking and similar algorithms are rarely used, unless
the environment can be tightly controlled.
This color tracking example was able to search pixel-by-pixel, because colors are stored
inside of single pixels.
But this approach doesn’t work for features larger than a single pixel, like edges of
objects, which are inherently made up of many pixels.
To identify these types of features in images, computer vision algorithms have to consider
small regions of pixels, called patches.
As an example, let’s talk about an algorithm that finds vertical edges in a scene, let’s
say to help a drone navigate safely through a field of obstacles.
To keep things simple, we’re going to convert our image into grayscale, although most algorithms
can handle color.
Now let’s zoom into one of these poles to see what an edge looks like up close.
We can easily see where the left edge of the pole starts, because there’s a change in
color that persists across many pixels vertically.
We can define this behavior more formally by creating a rule that says the likelihood
of a pixel being a vertical edge is the magnitude of the difference in color between some pixels
to its left and some pixels to its right.
The bigger the color difference between these two sets of pixels, the more likely the pixel
is on an edge.
If the color difference is small, it’s probably not an edge at all.
The mathematical notation for this operation looks like this – it’s called a kernel
or filter.
It contains the values for a pixel-wise multiplication, the sum of which is saved into the center pixel.
Let’s see how this works for our example pixel.
I’ve gone ahead and labeled all of the pixels with their grayscale values.
Now, we take our kernel, and center it over our pixel of interest.
This specifies what each pixel value underneath should be multiplied by.
Then, we just add up all those numbers.
In this example, that gives us 147.
That becomes our new pixel value.
This operation, of applying a kernel to a patch of pixels, is call a convolution.
Now let’s apply our kernel to another pixel.
In this case, the result is 1.
Just 1.
In other words, it’s a very small color difference, and not an edge.
If we apply our kernel to every pixel in the photo, the result looks like this, where the
highest pixel values are where there are strong vertical edges.
Note that horizontal edges, like those platforms in the background, are almost invisible.
If we wanted to highlight those features, we’d have to use a different kernel – one
that’s sensitive to horizontal edges.
Both of these edge enhancing kernels are called Prewitt Operators, named after their inventor.
These are just two examples of a huge variety of kernels, able to perform many different
image transformations.
For example, here’s a kernel that sharpens images.
And here’s a kernel that blurs them.
Kernels can also be used like little image cookie cutters that match only certain shapes.
So, our edge kernels looked for image patches with strong differences from right to left
or up and down.
But we could also make kernels that are good at finding lines, with edges on both sides.
And even islands of pixels surrounded by contrasting colors.
These types of kernels can begin to characterize simple shapes.
For example, on faces, the bridge of the nose tends to be brighter than the sides of the
nose, resulting in higher values for line-sensitive kernels.
Eyes are also distinctive – a dark circle sounded by lighter pixels – a pattern other
kernels are sensitive to.
When a computer scans through an image, most often by sliding around a search window, it
can look for combinations of features indicative of a human face.
Although each kernel is a weak face detector by itself, combined, they can be quite accurate.
It’s unlikely that a bunch of face-like features will cluster together if they’re
not a face.
This was the basis of an early and influential algorithm called Viola-Jones Face Detection.
Today, the hot new algorithms on the block are Convolutional Neural Networks.
We talked about neural nets last episode, if you need a primer.
In short, an artificial neuron – which is the building block of a neural network – takes
a series of inputs, and multiplies each by a specified weight, and then sums those values
all together.
This should sound vaguely familiar, because it’s a lot like a convolution.
In fact, if we pass a neuron 2D pixel data, rather than a one-dimensional list of inputs,
it’s exactly like a convolution.
The input weights are equivalent to kernel values, but unlike a predefined kernel, neural
networks can learn their own useful kernels that are able to recognize interesting features
in images.
Convolutional Neural Networks use banks of these neurons to process image data, each
outputting a new image, essentially digested by different learned kernels.
These outputs are then processed by subsequent layers of neurons, allowing for convolutions
on convolutions on convolutions.
The very first convolutional layer might find things like edges, as that’s what a single
convolution can recognize, as we’ve already discussed.
The next layer might have neurons that convolve on those edge features to recognize simple
shapes, comprised of edges, like corners.
A layer beyond that might convolve on those corner features, and contain neurons that
can recognize simple objects, like mouths and eyebrows.
And this keeps going, building up in complexity, until there’s a layer that does a convolution
that puts it together: eyes, ears, mouth, nose, the whole nine yards, and says “ah
ha, it’s a face!”
Convolutional neural networks aren’t required to be many layers deep, but they usually are,
in order to recognize complex objects and scenes.
That’s why the technique is considered deep learning.
Both Viola-Jones and Convolutional Neural Networks can be applied to many image recognition
problems, beyond faces, like recognizing handwritten text, spotting tumors in CT scans and monitoring
traffic flow on roads.
But we’re going to stick with faces.
Regardless of what algorithm was used, once we’ve isolated a face in a photo, we can
apply more specialized computer vision algorithms to pinpoint facial landmarks, like the tip
of the nose and corners of the mouth.
This data can be used for determining things like if the eyes are open, which is pretty
easy once you have the landmarks – it’s just the distance between points.
We can also track the position of the eyebrows; their relative position to the eyes can be
an indicator of surprise, or delight.
Smiles are also pretty straightforward to detect based on the shape of mouth landmarks.
All of this information can be interpreted by emotion recognition algorithms, giving
computers the ability to infer when you’re happy, sad, frustrated, confused and so on.
In turn, that could allow computers to intelligently adapt their behavior... maybe offer tips when
you’re confused, and not ask to install updates when you’re frustrated.
This is just one example of how vision can give computers the ability to be context sensitive,
that is, aware of their surroundings.
And not just the physical surroundings – like if you're at work or on a train – but also
your social surroundings – like if you’re in a formal business meeting versus a friend’s
birthday party.
You behave differently in those surroundings, and so should computing devices, if they’re smart.
Facial landmarks also capture the geometry of your face, like the distance between your
eyes and the height of your forehead.
This is one form of biometric data, and it allows computers with cameras to recognize
you.
Whether it’s your smartphone automatically unlocking itself when it sees you, or governments
tracking people using CCTV cameras, the applications of face recognition seem limitless.
There have also been recent breakthroughs in landmark tracking for hands and whole bodies,
giving computers the ability to interpret a user’s body language, and what hand gestures
they’re frantically waving at their internet connected microwave.
As we’ve talked about many times in this series, abstraction is the key to building
complex systems, and the same is true in computer vision.
At the hardware level, you have engineers building better and better cameras, giving
computers improved sight with each passing year, which I can’t say for myself.
Using that camera data, you have computer vision algorithms crunching pixels to find
things like faces and hands.
And then, using output from those algorithms, you have even more specialized algorithms
for interpreting things like user facial expression and hand gestures.
On top of that, there are people building novel interactive experiences, like smart
TVs and intelligent tutoring systems, that respond to hand gestures and emotion.
Each of these levels are active areas of research, with breakthroughs happening every year.
And that’s just the tip of the iceberg.
Today, computer vision is everywhere – whether it’s barcodes being scanned at stores, self-driving
cars waiting at red lights, or snapchat filters superimposing mustaches.
And, the most exciting thing is that computer scientists are really just getting started,
enabled by recent advances in computing, like super fast GPUs.
Computers with human-like ability to see is going to totally change how we interact with them.
Of course, it’d also be nice if they could hear and speak, which we’ll discuss next
week.
I’ll see you then.
浏览更多相关视频
Geoffrey Hinton: The Foundations of Deep Learning
Psychology of Computing: Crash Course Computer Science #38
Natural Language Processing: Crash Course Computer Science #36
Screens & 2D Graphics: Crash Course Computer Science #23
Lecture 1.1 — Why do we need machine learning — [ Deep Learning | Geoffrey Hinton | UofT ]
Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy
5.0 / 5 (0 votes)