Computer Vision: Crash Course Computer Science #35

CrashCourse
15 Nov 201711:09

Summary

TLDR本视频介绍了计算机视觉的重要性和基本原理。计算机视觉旨在使计算机能够从数字图像和视频中提取高层次的理解。视频首先探讨了像素和RGB颜色模型,然后介绍了如何通过颜色跟踪算法来追踪图像中的对象。接着,讨论了如何使用卷积核来识别图像中的边缘和其他特征,例如垂直边缘和人脸特征。此外,视频还介绍了卷积神经网络(CNN)的概念,这是一种能够通过多层卷积来识别复杂图像特征的深度学习技术。最后,视频讨论了计算机视觉在面部识别、情感识别和手势识别等领域的应用,并展望了计算机视觉技术如何改变我们与计算机的互动方式。

Takeaways

  • 👀 计算机视觉是计算机科学的一个子领域,目标是让计算机能够从数字图像和视频中提取高层次的理解。
  • 📷 计算机非常擅长捕捉具有极高保真度和细节的照片,但拍照并不等于“看”。
  • 🔍 最简单的计算机视觉算法之一是追踪具有特定颜色的对象,例如一个亮粉色的球。
  • 🌈 图像在计算机中通常以像素网格的形式存储,每个像素由红色、绿色和蓝色的组合定义,称为RGB值。
  • 🔳 灰度图像转换可以帮助简化算法,例如在寻找垂直边缘时。
  • 📏 通过使用核心(kernel)或过滤器,计算机视觉算法可以识别图像中的边缘和其他特征。
  • 🤖 无人机导航等应用可以通过识别图像中的边缘来帮助安全地避开障碍物。
  • 🧠 卷积神经网络(CNN)是当前深度学习领域的热门算法,它们可以通过学习识别图像中的有趣特征。
  • 👥 人脸识别算法可以识别照片中的人脸,并通过面部标记点来确定眼睛是否睁开、眉毛的位置等。
  • 😀 情绪识别算法可以解释面部表情,推断出人的情绪状态,如快乐、悲伤、沮丧或困惑。
  • 🔑 生物特征数据,如面部几何形状,允许计算机识别个人,应用于智能手机解锁或政府监控。
  • 🤲 手势和全身跟踪的最新进展使计算机能够解释用户的身体语言和手势。

Q & A

  • 计算机视觉的目标是什么?

    -计算机视觉的目标是赋予计算机从数字图像和视频中提取高层次理解的能力。

  • 为什么说计算机在捕捉照片方面比人类做得更好?

    -计算机在捕捉照片方面比人类做得更好,因为它们能够以难以置信的真实度和细节捕捉照片,尽管如此,拍照并不等同于真正地“看”。

  • 在计算机视觉中,最简单的算法是什么,它如何工作?

    -最简单的计算机视觉算法之一是追踪一个有颜色的物体,例如一个鲜亮的粉红色球。算法首先记录球的颜色,即中心像素的RGB值,然后通过比较图像中每个像素与目标颜色的差异来找到最匹配的像素。

  • 为什么基于颜色标记的跟踪算法在实际应用中很少使用?

    -基于颜色标记的跟踪算法很少使用,因为它们容易受到光照变化、阴影和其他效果的影响,而且在环境不能严格控制的情况下,算法的跟踪效果可能会很差。

  • 什么是卷积操作,它在图像处理中扮演什么角色?

    -卷积操作是将一个称为核或滤波器的矩阵应用于图像中的像素块。通过卷积,可以识别图像中的边缘、形状和其他特征,是图像处理和计算机视觉中的一种基本操作。

  • Prewitt算子是什么,它们在图像处理中有什么作用?

    -Prewitt算子是一种用于增强图像边缘的核,它们能够突出显示图像中的垂直边缘和水平边缘。这些算子以它们的发明者命名,是计算机视觉中用于图像变换的众多核中的两个例子。

  • 卷积神经网络(CNN)是如何工作的?

    -卷积神经网络使用一系列神经元来处理图像数据,每个神经元都相当于一个滤波器,能够识别图像中的有趣特征。与预定义的核不同,神经网络可以学习自己的有用核。CNN通过多层神经元处理数据,每一层都对输入图像进行卷积,逐渐构建起对复杂对象和场景的识别。

  • 为什么卷积神经网络通常需要很多层?

    -卷积神经网络通常需要很多层,以便识别复杂对象和场景。每一层都对前一层的输出进行进一步的卷积处理,从而逐步提高识别的复杂性,这种技术被认为是深度学习。

  • 计算机视觉中的面部识别技术可以用于哪些应用?

    -面部识别技术可以用于智能手机的自动解锁、政府使用CCTV摄像头追踪人员、智能电视和智能辅导系统响应手势和情感等多种应用。

  • 面部标记跟踪如何帮助计算机理解人的情感?

    -面部标记跟踪可以捕捉到面部的几何特征,如眼睛之间的距离和前额的高度。这些数据可以用于确定眼睛是否睁开、眉毛的位置以及嘴巴的形状,从而推断出人的情感状态,如快乐、悲伤、沮丧或困惑。

  • 手势和身体语言的跟踪对计算机视觉意味着什么?

    -手势和身体语言的跟踪使计算机能够解释用户的身体语言和手势,这为创建响应手势和情感的新型交互体验,如智能电视和智能辅导系统,提供了可能。

  • 计算机视觉领域的最新进展是什么?

    -计算机视觉领域的最新进展包括在硬件层面上工程师们建造更好的相机,以及在软件层面上开发更先进的算法来处理像素数据,识别面部、手势等。此外,还有研究人员在构建新颖的交互体验,如智能电视和智能辅导系统,这些系统能够响应手势和情感。

Outlines

00:00

👀 计算机视觉简介

Carrie Anne介绍了计算机视觉的重要性,它是如何帮助我们准备食物、避开障碍物、阅读路标等。计算机视觉的目标是让计算机能够从数字图像和视频中提取高层次的理解。尽管计算机在捕捉图像方面非常出色,但它们并不等同于真正“看见”。视频还介绍了图像在计算机中是如何以像素网格的形式存储的,每个像素由红绿蓝三种颜色的组合来定义。接着,她通过追踪一个颜色对象(如粉红色球)的简单算法,解释了如何记录球的颜色并找到与之最匹配的像素。此外,还讨论了由于光照变化、阴影等因素,颜色跟踪算法的局限性。

05:00

🔍 边缘检测与卷积运算

视频进一步探讨了如何识别图像中的特征,如物体的边缘。通过将图像转换为灰度,并使用一个寻找垂直边缘的算法,解释了边缘是如何由多个像素组成的。介绍了如何使用一个称为“核”或“滤波器”的数学工具来识别边缘。通过将核与像素块相乘并求和,这个过程称为卷积。举例说明了如何使用不同的核来增强图像的边缘,如Prewitt算子,以及如何使用核来识别特定形状。

10:01

🧠 卷积神经网络与面部识别

Carrie Anne讨论了如何使用卷积神经网络(CNN)来识别图像中的特征,如面部。CNN通过多层神经元处理图像数据,每一层都能识别不同的特征,从简单的边缘到复杂的对象,如面部。这些网络能够学习自己的核,从而识别图像中的有趣特征。此外,还介绍了如何利用面部识别技术来确定面部标志点,如鼻子的尖端和嘴角,以及如何使用这些数据来检测情绪和进行面部识别。最后,她提到了计算机视觉在手势和身体语言识别方面的进展,以及这些技术如何改变我们与计算机的互动方式。

🌟 计算机视觉的未来

视频最后展望了计算机视觉的未来,指出计算机具有类似人类的视力将如何彻底改变我们与它们的互动方式。Carrie Anne提到,除了视觉,计算机如果能听和说,将会更加完美。她预告了下周将讨论计算机听觉和语言能力的主题,并邀请观众届时收看。

Mindmap

Keywords

💡计算机视觉

计算机视觉是计算机科学的一个分支,旨在赋予计算机理解和解释图像和视频数据的能力。在视频中,计算机视觉被描述为一个关键领域,它使得计算机能够从数字图像和视频中提取高层次的理解。例如,通过计算机视觉技术,可以追踪特定颜色的物体,识别图像中的边缘和形状,甚至检测和识别人脸。

💡像素

像素是构成图像的基本单元,通常以栅格形式存储在计算机中。每个像素由三种原色(红、绿、蓝)的不同强度组合定义,称为RGB值。在视频中,像素被用来解释如何通过记录特定颜色的RGB值来追踪图像中的物体,如追踪一个亮粉色的球。

💡卷积神经网络

卷积神经网络(CNN)是一种深度学习算法,它使用多层的人工神经元来处理图像数据。在视频中,CNN被提及为当前计算机视觉领域中的前沿算法,它能够通过学习来识别图像中的有趣特征,从而进行图像识别。CNN通过堆叠多个卷积层,每一层都能够识别图像中的不同复杂度的特征,从边缘到简单形状,再到整个物体的识别。

💡特征检测

特征检测是计算机视觉中识别图像中特定模式或属性的过程。在视频中,特征检测用于描述如何通过算法来识别图像中的边缘、形状、颜色变化等。例如,使用Prewitt算子可以增强图像的边缘,而特定的内核可以用于检测线条或由对比色包围的像素区域。

💡边缘检测

边缘检测是计算机视觉中的一种基本技术,用于识别图像中不同区域之间的边界。在视频中,通过将图像转换为灰度并使用特定的内核(如Prewitt算子),可以突出显示图像中的垂直边缘,这对于无人机导航等应用至关重要。

💡内核/滤波器

内核或滤波器是计算机视觉中用于处理图像的数学工具,通过在像素集合上应用特定的权重来实现图像变换。在视频中,内核被用来执行卷积操作,以识别图像中的边缘、增强或模糊图像。内核可以设计成对特定形状或特征敏感,从而实现复杂的图像识别任务。

💡人脸识别

人脸识别是一种利用计算机视觉技术来识别和验证图像中人脸的过程。在视频中,人脸识别被用作一个例子,展示了如何通过检测面部特征(如眼睛、耳朵、嘴巴、鼻子)来识别人脸。此外,人脸识别还被用于智能手机的自动解锁和政府的监控系统中。

💡情感识别

情感识别是计算机视觉的一个应用,它使计算机能够通过分析人的面部表情和身体语言来推断人的情绪状态。在视频中,情感识别算法可以利用面部标记数据来确定眼睛是否睁开、眉毛的位置以及嘴巴的形状,从而推断出人是否快乐、悲伤、沮丧或困惑。

💡生物特征识别

生物特征识别是一种技术,它通过分析个体独特的生理或行为特征来识别人。在视频中,面部几何特征(如眼睛间的距离和前额的高度)被用作生物特征数据,使计算机能够识别和验证个人,例如智能手机的面部解锁功能。

💡手势识别

手势识别是计算机视觉中的一个分支,它使计算机能够理解和解释人的手部动作和手势。在视频中,提到了手势识别在智能电视和智能辅导系统中的应用,这些系统可以根据手部动作和情感响应用户。

💡深度学习

深度学习是机器学习的一个子领域,它使用多层神经网络来模拟人脑处理信息的方式。在视频中,深度学习被用来描述卷积神经网络如何通过多层的卷积操作来识别图像中的复杂对象和场景。深度学习在计算机视觉中的应用包括图像识别、语音识别和其他需要处理大量数据的任务。

Highlights

计算机视觉领域旨在赋予计算机从数字图像和视频中提取高层次理解的能力。

计算机在捕获照片方面比人类更精确,但拍照并不等同于真正的“看见”。

图像在计算机中通常以像素网格的形式存储,每个像素由红绿蓝三种颜色组合定义。

最简单的计算机视觉算法之一是追踪一个有颜色的物体,例如一个鲜粉色的球。

通过记录球的RGB值,计算机程序可以找到与目标颜色最接近的像素。

算法可以应用于视频的每一帧,从而实现随时间追踪物体。

由于光照、阴影等影响,物体的颜色可能会有所变化,导致追踪效果不佳。

计算机视觉算法需要考虑像素块(patches)来识别比单个像素大的特征,如物体的边缘。

通过使用核(kernel)或过滤器,算法可以定义像素成为垂直边缘的可能性。

核操作,或称为卷积,是通过将核应用于像素块来进行的。

不同的核可以用于不同的图像转换,如锐化或模糊图像。

卷积神经网络(CNN)使用神经元的集合来处理图像数据,每个神经元输出一个由不同学习到的核处理的新图像。

CNN通过多层卷积能够识别复杂对象和场景,这是深度学习的一部分。

面部识别算法可以应用于识别人脸、手势和身体语言,进而推断情绪状态。

面部标记跟踪可以捕捉面部的几何形状,形成生物特征数据,用于个人识别。

计算机视觉技术的应用广泛,包括条形码扫描、自动驾驶车辆、智能电视等。

计算机视觉的进步使得计算机能够更智能地适应环境,提供上下文感知的交互体验。

计算机视觉领域的研究活跃,每年都有新的突破,预示着计算机将拥有更接近人类的视力。

Transcripts

play00:03

Hi, I’m Carrie Anne, and welcome to Crash Course Computer Science!

play00:05

Today, let’s start by thinking about how important vision can be.

play00:09

Most people rely on it to prepare food, walk around obstacles, read street signs, watch

play00:13

videos like this, and do hundreds of other tasks.

play00:16

Vision is the highest bandwidth sense, and it provides a firehose of information about

play00:20

the state of the world and how to act on it.

play00:22

For this reason, computer scientists have been trying to give computers vision for half

play00:26

a century, birthing the sub-field of computer vision.

play00:29

Its goal is to give computers the ability to extract high-level understanding from digital

play00:33

images and videos.

play00:35

As everyone with a digital camera or smartphone knows, computers are already really good at

play00:39

capturing photos with incredible fidelity and detail – much better than humans in fact.

play00:43

But as computer vision professor Fei-Fei Li recently said, “Just like to hear is the

play00:48

not the same as to listen.

play00:49

To take pictures is not the same as to see.”

play00:52

INTRO

play01:01

As a refresher, images on computers are most often stored as big grids of pixels.

play01:05

Each pixel is defined by a color, stored as a combination of three additive primary colors:

play01:10

red, green and blue.

play01:11

By combining different intensities of these three colors, what’s called a RGB value,

play01:14

we can represent any color.

play01:17

Perhaps the simplest computer vision algorithm – and a good place to start – is to track

play01:21

a colored object, like a bright pink ball.

play01:23

The first thing we need to do is record the ball’s color.

play01:26

For that, we’ll take the RGB value of the centermost pixel.

play01:30

With that value saved, we can give a computer program an image, and ask it to find the pixel

play01:34

with the closest color match.

play01:35

An algorithm like this might start in the upper right corner, and check each pixel,

play01:39

one at time, calculating the difference from our target color.

play01:43

Now, having looked at every pixel, the best match is very likely a pixel from our ball.

play01:47

We’re not limited to running this algorithm on a single photo; we can do it for every

play01:51

frame in a video, allowing us to track the ball over time.

play01:54

Of course, due to variations in lighting, shadows, and other effects, the ball on the

play01:58

field is almost certainly not going to be the exact same RGB value as our target color,

play02:03

but merely the closest match.

play02:04

In more extreme cases, like at a game at night, the tracking might be poor.

play02:08

And if one of the team's jerseys used the same color as the ball, our algorithm would

play02:12

get totally confused.

play02:13

For these reasons, color marker tracking and similar algorithms are rarely used, unless

play02:18

the environment can be tightly controlled.

play02:20

This color tracking example was able to search pixel-by-pixel, because colors are stored

play02:25

inside of single pixels.

play02:27

But this approach doesn’t work for features larger than a single pixel, like edges of

play02:31

objects, which are inherently made up of many pixels.

play02:34

To identify these types of features in images, computer vision algorithms have to consider

play02:38

small regions of pixels, called patches.

play02:40

As an example, let’s talk about an algorithm that finds vertical edges in a scene, let’s

play02:45

say to help a drone navigate safely through a field of obstacles.

play02:48

To keep things simple, we’re going to convert our image into grayscale, although most algorithms

play02:52

can handle color.

play02:53

Now let’s zoom into one of these poles to see what an edge looks like up close.

play02:57

We can easily see where the left edge of the pole starts, because there’s a change in

play03:01

color that persists across many pixels vertically.

play03:04

We can define this behavior more formally by creating a rule that says the likelihood

play03:08

of a pixel being a vertical edge is the magnitude of the difference in color between some pixels

play03:13

to its left and some pixels to its right.

play03:15

The bigger the color difference between these two sets of pixels, the more likely the pixel

play03:19

is on an edge.

play03:20

If the color difference is small, it’s probably not an edge at all.

play03:23

The mathematical notation for this operation looks like this – it’s called a kernel

play03:27

or filter.

play03:28

It contains the values for a pixel-wise multiplication, the sum of which is saved into the center pixel.

play03:33

Let’s see how this works for our example pixel.

play03:36

I’ve gone ahead and labeled all of the pixels with their grayscale values.

play03:39

Now, we take our kernel, and center it over our pixel of interest.

play03:43

This specifies what each pixel value underneath should be multiplied by.

play03:46

Then, we just add up all those numbers.

play03:49

In this example, that gives us 147.

play03:51

That becomes our new pixel value.

play03:54

This operation, of applying a kernel to a patch of pixels, is call a convolution.

play03:58

Now let’s apply our kernel to another pixel.

play04:00

In this case, the result is 1.

play04:02

Just 1.

play04:03

In other words, it’s a very small color difference, and not an edge.

play04:06

If we apply our kernel to every pixel in the photo, the result looks like this, where the

play04:10

highest pixel values are where there are strong vertical edges.

play04:13

Note that horizontal edges, like those platforms in the background, are almost invisible.

play04:18

If we wanted to highlight those features, we’d have to use a different kernel – one

play04:22

that’s sensitive to horizontal edges.

play04:23

Both of these edge enhancing kernels are called Prewitt Operators, named after their inventor.

play04:29

These are just two examples of a huge variety of kernels, able to perform many different

play04:33

image transformations.

play04:34

For example, here’s a kernel that sharpens images.

play04:37

And here’s a kernel that blurs them.

play04:39

Kernels can also be used like little image cookie cutters that match only certain shapes.

play04:43

So, our edge kernels looked for image patches with strong differences from right to left

play04:48

or up and down.

play04:49

But we could also make kernels that are good at finding lines, with edges on both sides.

play04:53

And even islands of pixels surrounded by contrasting colors.

play04:57

These types of kernels can begin to characterize simple shapes.

play05:00

For example, on faces, the bridge of the nose tends to be brighter than the sides of the

play05:04

nose, resulting in higher values for line-sensitive kernels.

play05:08

Eyes are also distinctive – a dark circle sounded by lighter pixels – a pattern other

play05:12

kernels are sensitive to.

play05:14

When a computer scans through an image, most often by sliding around a search window, it

play05:18

can look for combinations of features indicative of a human face.

play05:22

Although each kernel is a weak face detector by itself, combined, they can be quite accurate.

play05:26

It’s unlikely that a bunch of face-like features will cluster together if they’re

play05:30

not a face.

play05:31

This was the basis of an early and influential algorithm called Viola-Jones Face Detection.

play05:35

Today, the hot new algorithms on the block are Convolutional Neural Networks.

play05:40

We talked about neural nets last episode, if you need a primer.

play05:42

In short, an artificial neuron – which is the building block of a neural network – takes

play05:47

a series of inputs, and multiplies each by a specified weight, and then sums those values

play05:52

all together.

play05:53

This should sound vaguely familiar, because it’s a lot like a convolution.

play05:56

In fact, if we pass a neuron 2D pixel data, rather than a one-dimensional list of inputs,

play06:01

it’s exactly like a convolution.

play06:03

The input weights are equivalent to kernel values, but unlike a predefined kernel, neural

play06:08

networks can learn their own useful kernels that are able to recognize interesting features

play06:12

in images.

play06:13

Convolutional Neural Networks use banks of these neurons to process image data, each

play06:17

outputting a new image, essentially digested by different learned kernels.

play06:21

These outputs are then processed by subsequent layers of neurons, allowing for convolutions

play06:26

on convolutions on convolutions.

play06:28

The very first convolutional layer might find things like edges, as that’s what a single

play06:32

convolution can recognize, as we’ve already discussed.

play06:35

The next layer might have neurons that convolve on those edge features to recognize simple

play06:39

shapes, comprised of edges, like corners.

play06:42

A layer beyond that might convolve on those corner features, and contain neurons that

play06:46

can recognize simple objects, like mouths and eyebrows.

play06:49

And this keeps going, building up in complexity, until there’s a layer that does a convolution

play06:54

that puts it together: eyes, ears, mouth, nose, the whole nine yards, and says “ah

play06:58

ha, it’s a face!”

play06:59

Convolutional neural networks aren’t required to be many layers deep, but they usually are,

play07:04

in order to recognize complex objects and scenes.

play07:07

That’s why the technique is considered deep learning.

play07:09

Both Viola-Jones and Convolutional Neural Networks can be applied to many image recognition

play07:14

problems, beyond faces, like recognizing handwritten text, spotting tumors in CT scans and monitoring

play07:19

traffic flow on roads.

play07:20

But we’re going to stick with faces.

play07:22

Regardless of what algorithm was used, once we’ve isolated a face in a photo, we can

play07:26

apply more specialized computer vision algorithms to pinpoint facial landmarks, like the tip

play07:31

of the nose and corners of the mouth.

play07:33

This data can be used for determining things like if the eyes are open, which is pretty

play07:38

easy once you have the landmarks – it’s just the distance between points.

play07:41

We can also track the position of the eyebrows; their relative position to the eyes can be

play07:45

an indicator of surprise, or delight.

play07:47

Smiles are also pretty straightforward to detect based on the shape of mouth landmarks.

play07:52

All of this information can be interpreted by emotion recognition algorithms, giving

play07:57

computers the ability to infer when you’re happy, sad, frustrated, confused and so on.

play08:02

In turn, that could allow computers to intelligently adapt their behavior... maybe offer tips when

play08:07

you’re confused, and not ask to install updates when you’re frustrated.

play08:11

This is just one example of how vision can give computers the ability to be context sensitive,

play08:16

that is, aware of their surroundings.

play08:18

And not just the physical surroundings – like if you're at work or on a train – but also

play08:21

your social surroundings – like if you’re in a formal business meeting versus a friend’s

play08:26

birthday party.

play08:27

You behave differently in those surroundings, and so should computing devices, if they’re smart.

play08:32

Facial landmarks also capture the geometry of your face, like the distance between your

play08:36

eyes and the height of your forehead.

play08:38

This is one form of biometric data, and it allows computers with cameras to recognize

play08:42

you.

play08:43

Whether it’s your smartphone automatically unlocking itself when it sees you, or governments

play08:47

tracking people using CCTV cameras, the applications of face recognition seem limitless.

play08:52

There have also been recent breakthroughs in landmark tracking for hands and whole bodies,

play08:56

giving computers the ability to interpret a user’s body language, and what hand gestures

play09:00

they’re frantically waving at their internet connected microwave.

play09:03

As we’ve talked about many times in this series, abstraction is the key to building

play09:07

complex systems, and the same is true in computer vision.

play09:10

At the hardware level, you have engineers building better and better cameras, giving

play09:14

computers improved sight with each passing year, which I can’t say for myself.

play09:18

Using that camera data, you have computer vision algorithms crunching pixels to find

play09:23

things like faces and hands.

play09:25

And then, using output from those algorithms, you have even more specialized algorithms

play09:29

for interpreting things like user facial expression and hand gestures.

play09:32

On top of that, there are people building novel interactive experiences, like smart

play09:37

TVs and intelligent tutoring systems, that respond to hand gestures and emotion.

play09:41

Each of these levels are active areas of research, with breakthroughs happening every year.

play09:46

And that’s just the tip of the iceberg.

play09:47

Today, computer vision is everywhere – whether it’s barcodes being scanned at stores, self-driving

play09:52

cars waiting at red lights, or snapchat filters superimposing mustaches.

play09:56

And, the most exciting thing is that computer scientists are really just getting started,

play10:01

enabled by recent advances in computing, like super fast GPUs.

play10:05

Computers with human-like ability to see is going to totally change how we interact with them.

play10:10

Of course, it’d also be nice if they could hear and speak, which we’ll discuss next

play10:14

week.

play10:15

I’ll see you then.

Rate This

5.0 / 5 (0 votes)

Related Tags
计算机视觉图像处理深度学习卷积神经网络面部识别情感分析生物识别技术交互体验智能设备算法原理技术突破
Do you need a summary in English?