Visualizing CNNs

NPTEL-NOC IITM
19 Aug 202425:34

Summary

TLDRThis lecture delves into visualization methods for understanding convolutional neural networks (CNNs), focusing on the analysis of kernels, filters, and activations. It discusses how the first layers of various CNN models detect edges and textures, resembling Gabor filters, and how higher layers capture more abstract features. Techniques such as dimensionality reduction on feature vectors, neuron activation visualization, and occlusion experiments are explored to reveal the inner workings of CNNs, providing insights into their feature learning and decision-making processes.

Takeaways

  • 🔍 The lecture discusses different visualization methods for understanding how Convolutional Neural Networks (CNNs) process images, focusing on the filters, activations, and representations within the network.
  • 👀 Visualizing the filters or kernels in the first convolutional layer of CNNs like AlexNet reveals that they tend to capture oriented edges, color-based edges, and higher-order variations, which are similar across various models like ResNet, DenseNet, and VGG.
  • 🌟 Filters in the first layer of CNNs are often Gabor-like, detecting edges and textures in various orientations and colors, which is consistent across different models and datasets.
  • 📈 Higher layers of CNNs are more challenging to visualize due to the complexity and variety of features they learn, which may not be as interpretable as the first layer's edge detection.
  • 📊 Dimensionality reduction techniques like PCA and t-SNE can be applied to the output of the fully connected layer (e.g., FC7 in AlexNet) to visualize the representation space learned by the CNN, showing that different classes are well-separated.
  • 📝 The penultimate layer's feature vectors from CNNs like AlexNet can capture semantic information about images, with similar objects grouping together in the reduced dimensional space.
  • 👨‍🏫 The script references the work of Zeiler and Fergus, which is foundational in visualizing and understanding what CNNs learn from image data.
  • 🔬 Occlusion experiments involve covering parts of an image to see how it affects the CNN's prediction, providing insights into which parts of the image the model relies on for classification.
  • 🤖 The script suggests that each neuron in a CNN learns to fire for specific features or artifacts in the images, contributing to the model's overall understanding and classification capabilities.
  • 📚 The lecture recommends further reading and resources, including lecture notes from CS231n at Stanford and a deep visualization toolkit demo by Jason Yosinski, for a deeper understanding of CNN visualization techniques.
  • 🛠️ The methods covered in the lecture are 'don't disturb the model' approaches, meaning they utilize the trained model without altering it to gain insights into its learned representations and decision-making process.

Q & A

  • What is the primary focus of the lecture on visualization methods in CNNs?

    -The lecture focuses on visualizing different kernels or filters in a CNN, activations in a particular layer, and other methods such as understanding what the CNN has learned through various visualization techniques.

  • How many filters does the first convolutional layer of AlexNet have, and what is their size?

    -The first convolutional layer of AlexNet has 96 filters, each with a size of 11x11.

  • What is a Gabor-like filter fatigue, and why is it called so?

    -Gabor-like filter fatigue refers to the observation that the filters in the first convolutional layer of various CNN models across different datasets tend to have very similar structures, detecting edges and patterns in a similar manner, hence the term 'fatigue' implying it's the same across models.

  • What is the purpose of visualizing the filters of higher layers in CNNs?

    -Visualizing filters of higher layers can help understand the features that the CNN has learned. However, it is generally not as interesting or interpretable as visualizing the first layer, especially in datasets with a wide variety of classes.

  • What is the role of the penultimate layer (e.g., fc7 in AlexNet) in CNNs?

    -The penultimate layer, such as fc7 in AlexNet, provides a high-dimensional representation of the input image. This layer's output is used for classification, and visualizing these representations can help understand how different classes are separated in the feature space.

  • How can one visualize the high-dimensional space of feature vectors from a CNN's penultimate layer?

    -Dimensionality reduction techniques such as Principal Component Analysis (PCA) or t-SNE can be used to project the high-dimensional feature vectors into a two-dimensional space for visualization.

  • What does the visualization of the first convolutional layer's filters across different models and datasets suggest about the initial learning of CNNs?

    -The visualization suggests that the first layer of CNNs learns to detect low-level image features such as edges, color gradients, and textures, which is similar across different models and datasets.

  • What is the significance of visualizing which images maximally activate a particular neuron in a CNN?

    -This visualization helps in understanding what specific features or artifacts in the images are being captured by individual neurons, providing insights into the learning process of the CNN.

  • What are occlusion experiments in the context of CNN visualization?

    -Occlusion experiments involve covering parts of an image and observing the effect on the CNN's predictions. This method helps identify which parts of the image are most critical for the CNN's decision-making process.

  • How do occlusion experiments help in understanding the CNN's focus during image classification?

    -By occluding different parts of an image and observing changes in the predicted probability, occlusion experiments reveal which pixels or regions the CNN relies on to make its classification, indicating its focus area.

  • What is the recommended approach for further understanding of the visualization methods discussed in the lecture?

    -The lecture recommends reviewing the lecture notes of CS231n, exploring a deep visualization toolkit demo video by Jason Yosinski, and studying more about t-SNE as a dimensionality reduction technique through the provided links.

Outlines

00:00

🔍 Visualizing CNN Kernels and Filters

This paragraph delves into the visualization methods of kernels or filters within Convolutional Neural Networks (CNNs), focusing on the initial layers where filters typically capture basic image features like edges and colors. The discussion references AlexNet, highlighting its first convolutional layer's filters and how they are visualized in a grid format. It's noted that these filters are not unique to AlexNet, as similar structures are found in other models like ResNet and DenseNet, which learn to detect edges, patterns, and color variations without being pre-programmed. The paragraph emphasizes the self-learning capability of CNNs in discerning low-level image features.

05:02

🌐 Higher Layer Kernels and Representation Space Visualization

The second paragraph explores visualizing kernels in higher layers of CNNs, contrasting the general lack of interpretability in these layers with the more straightforward visualizations of the first layer. It mentions that higher layer filters can sometimes be understood in the context of specific applications but are less informative for broader datasets like ImageNet. The paragraph then introduces the concept of visualizing the representation space learned by CNNs, such as using PCA or t-SNE to reduce the dimensionality of feature vectors from layers like FC7 in AlexNet, allowing for the visualization and understanding of class separation in datasets like MNIST and ImageNet.

10:03

📊 Understanding CNN Representations Through Feature Maps

This section examines the visualization of feature maps in CNNs, particularly in AlexNet, to understand how the network captures higher-level semantics of objects in images. It describes the process of visualizing the CONV5 feature maps, which are 128 in number and each 13x13 pixels, to observe how certain filters respond to the presence of specific entities like people or dogs in the input image. The paragraph also touches on the idea of investigating individual neurons in intermediate layers to see which images maximally activate them, providing insights into the diverse features learned by the CNN.

15:05

👀 Neuron Activation and Receptive Field Analysis

The fourth paragraph discusses the method of understanding what specific neurons in a CNN respond to by monitoring their activation in response to various images. It explains how one can trace back the receptive fields of these neurons through the layers of the network to identify the regions in the original image that lead to their activation. The summary illustrates how different neurons may become specialized in detecting certain features like human busts, dog faces, or specular reflections, and relates this to the concept of dropout, ensuring a diverse learning across neurons.

20:06

🚫 Occlusion Experiments for Model Interpretability

This paragraph introduces occlusion experiments as a method for interpreting CNN models by understanding which parts of an image are critical for the model's predictions. It describes the process of covering different parts of an image with a gray patch and observing the effect on the predicted probability of the correct class. The summary explains how this method can reveal whether the model is focusing on the correct part of the image for its predictions, using examples where occluding parts of the image leads to a drop in the probability of the correct label, indicating the model's reliance on those areas.

25:07

📚 Recommended Readings and Visualization Tools

The final paragraph provides recommendations for further understanding CNN visualization techniques. It suggests reading the lecture notes from CS231n and exploring a deep visualization toolkit demo by Jason Yosinski. Additionally, it mentions resources for learning more about t-SNE as a dimensionality reduction technique and encourages exploring these tools and methods to gain a deeper understanding of CNNs and their learned representations.

Mindmap

Keywords

💡CNN

CNN stands for Convolutional Neural Network, which is a type of deep learning model widely used in image recognition tasks. In the video, CNNs are the central theme, with discussions on how they learn to visualize different features from images through various layers. For example, the script mentions AlexNet, a specific type of CNN, and how its first convolutional layer uses filters to detect edges.

💡Filters or Kernels

In the context of CNNs, filters or kernels are small windows that slide over the input image to apply convolution operations. The script emphasizes visualizing these filters, showing how they capture features like edges or colors in the initial layers of CNNs such as AlexNet.

💡Activations

Activations refer to the output of neurons in a neural network after a non-linear activation function is applied. The script discusses visualizing activations in a CNN to understand what features are being learned at different layers of the network.

💡Gabor Filters

Gabor filters are used in the script to describe the type of filters learned in the first convolutional layer of various CNN models. They are filters that detect edges in various orientations and are likened to the filters learned by CNNs without any pre-specification.

💡Feature Maps

Feature maps are the outputs of convolutional layers in a CNN, representing different features detected in the input image. The script provides an example of how the feature maps in AlexNet's 'conv5' layer capture higher-level semantics of objects in images.

💡Dimensionality Reduction

Dimensionality reduction is a technique used to visualize high-dimensional data by reducing it to two or three dimensions. The script describes using PCA and t-SNE for visualizing the space of feature vectors from a CNN's fully connected layer, helping to understand how different classes of images are separated.

💡t-SNE

t-SNE stands for t-Distributed Stochastic Neighbor Embedding, a method for dimensionality reduction used in the script to visualize complex data sets like MNIST and ImageNet. It helps in creating a two-dimensional representation where similar items cluster together.

💡Occlusion Experiments

Occlusion experiments involve covering parts of an image to determine which areas are most important for the CNN's classification decision. The script describes using this method to understand which pixels in an image the CNN focuses on to make its prediction.

💡Receptive Field

The receptive field of a neuron in a CNN is the region of the input image that the neuron responds to. The script explains how understanding the receptive field can help visualize which parts of the image are most influential for a particular neuron's activation.

💡Semantics

In the context of the script, semantics refers to the meaning or the higher-level understanding that CNNs can derive from images. It is mentioned that the representations from the penultimate layer of a CNN capture the semantic nature of images, grouping similar objects together.

💡Dropout

Dropout is a regularization technique used in neural networks to prevent overfitting by randomly deactivating a subset of neurons during training. The script connects the idea of neurons learning diverse features with the concept of dropout, ensuring that no single feature dominates the learning process.

Highlights

Introduction to visualization methods for kernels, filters, and activations in CNNs.

Lecture slides based on cs231n at Stanford and inspired by lectures of M. P. Kaa at IIT Madras.

Visualization of filters or kernels in CNNs, starting with the simplest form.

Filters in CNNs capture oriented edges and higher-order variations like checkerboard patterns.

Filters in the first convolutional layer of various models like AlexNet, ResNet, and DenseNet show similar structures.

Filters in higher layers are less interpretable due to the variety of classes and abstractions.

First convolutional layer filters across models and datasets exhibit Gabor-like filter characteristics.

Visualization of the representation space learned by CNNs using dimensionality reduction methods.

TSNE is a powerful dimensionality reduction technique for visualizing high-dimensional data.

Visualization of feature vectors from the penultimate layer shows class separation in 2D space.

Embeddings from the penultimate layer capture semantic nature of images, grouping similar objects together.

Visualization of Convolutional feature maps in AlexNet reveals higher-level semantics captured by later layers.

Understanding what specific neurons in CNNs respond to by monitoring their activation across images.

Occlusion experiments to determine which parts of an image are most relevant for CNN's predictions.

Heatmaps from occlusion experiments indicate the importance of specific image regions for classification.

Summary of 'Don't Disturb the Model' methods for understanding what a CNN has learned without altering the model.

Recommended readings and resources for further understanding of CNN visualization techniques.

Transcripts

play00:02

[Music]

play00:14

we will begin this lecture on

play00:17

visualization methods

play00:20

of different kernels or filters in a CNN

play00:23

or perhaps even activations in a

play00:26

particular layer of a CNN or even other

play00:29

methods that we will see later in this

play00:32

lecture most of this lecture slides are

play00:35

based on lecture 13 of

play00:39

cs231n at Stanford and some of the

play00:42

content is borrowed from the excellent

play00:45

lectures of mesh Kaa at I medras let's

play00:48

start by the simplest form of

play00:51

visualization which is visualizing the

play00:54

filters or kernels themselves remember

play00:59

that when you have a CNN in every

play01:03

convolutional layer you have a certain

play01:07

number of filters for example if you

play01:10

recall in the Alex net the first

play01:13

convolutional layer had 11 crossle

play01:16

filters it actually had 96 of them 48

play01:20

going to one GPU and 48 going through

play01:23

the other GPU if you recall alexnet

play01:26

architecture in this particular slide

play01:28

that you see here we are looking at a

play01:31

variant of alexnet which was developed

play01:34

by Alex kvki a little later in

play01:37

2014 when he came up with a method to

play01:41

parallelize CNN this is just an example

play01:44

to be able to visualize this more easily

play01:48

so in this variant of

play01:50

alexnet the architecture had 64 filters

play01:54

in the first convolutional layer so what

play01:57

you see on the left top here

play02:00

is 64 filters each there are each 11

play02:06

cross 11 in size and each of them have

play02:09

three channels the r Channel G Channel

play02:12

and B Channel the three colors so that's

play02:14

what we have as the total number of

play02:16

filters so if we visualized each of them

play02:20

on a grid such as this remember that a

play02:23

filter is an image in its own right just

play02:27

like how convolution is commutative you

play02:30

can always choose to look at an image as

play02:32

a filter or a filter as an image it does

play02:34

not matter uh any Matrix of the size of

play02:38

the filter can also be plotted as an

play02:40

image so when you do it that way you get

play02:44

something like what you see on the top

play02:46

right here so let's try to look at some

play02:48

of them more carefully so if you

play02:51

visualized some of these filters more

play02:54

carefully you see that there are filters

play02:58

that try to capture oriented edges so

play03:01

you can see this one here on the bottom

play03:04

Row the fourth from left which captures

play03:07

it looks like a a gossan edge detector

play03:11

which is smoothened out along a certain

play03:15

orientation similarly you have another

play03:18

Edge detector here another Edge detector

play03:21

here on top you also have some which cap

play03:25

capture slightly higher order variations

play03:27

such as a checkerboard kind of a pattern

play03:29

or a series of striations so on and so

play03:33

forth you also have color-based Edge

play03:37

detectors so in the last filter here on

play03:40

the bottom right you see a an edge

play03:43

detector that goes from green to pinkish

play03:46

or red color so you see a similar uh

play03:50

color based filters even on the top left

play03:52

here so does is this a characteristic of

play03:56

alexnet alone not really if you you took

play04:00

even the filters of rest net 18 rest net

play04:03

101 or dens net 121 in each of these the

play04:08

filters in the first convolutional layer

play04:11

have a very similar structure all of

play04:14

them detect edges of different

play04:17

orientations uh certain higher order

play04:20

interactions such as checko patterns

play04:22

striations and different orientations

play04:25

color blobs certain

play04:27

color uh gradation

play04:31

as in edges in different colors so on

play04:34

and so forth you will see this as part

play04:37

of the assignment in this week where you

play04:38

try out some of these

play04:40

experiments so this tells us that the

play04:43

first layer seems to be acting like

play04:46

low-level image processing Edge

play04:48

detection blob detection uh maybe a

play04:51

checkerboard detection so on and so

play04:53

forth remember here that these are

play04:56

filters that are completely learned by a

play04:58

neural network which we did not Prime in

play05:01

any

play05:03

way you can also visualize the kernels

play05:07

of higher layers just like how we did it

play05:09

for the first convolutional layer you

play05:11

could also take all the filters of the

play05:14

second convolutional layer the third

play05:15

convolutional layer so on and so forth

play05:18

but it happens that if you had to

play05:20

generalize them across applications they

play05:23

are not that interesting we did see an

play05:26

earlier example last week where we took

play05:28

face images and showed that filters in

play05:31

the first layer correspond to low-level

play05:34

uh image features then we talked about

play05:37

middle layers extracting noses and eyes

play05:39

and so on and then we talked about the

play05:41

later layers extracting face level

play05:43

information it does happen in certain

play05:46

applications but in general if you had a

play05:49

wider range of objects if you only

play05:50

focused on faces or a smaller group of

play05:53

objects maybe you could make sense of

play05:55

the higher layers filters but in a more

play05:58

General context text such as image net

play06:01

which has thousand classes in the data

play06:03

set these kinds of visualizations of

play06:06

filters of higher layers are not that

play06:09

interesting so here are uh some examples

play06:13

here so you can see that the weights

play06:15

remember in a CNN the weights are the

play06:17

filters themselves so if you look at

play06:19

weights in a later layer you see that it

play06:22

may not be that interesting

play06:26

for for understanding what a CNN is

play06:29

actually learning that's because of the

play06:32

variety of classes that may result in

play06:34

various abstractions across the data

play06:37

set so the input to the higher layers is

play06:40

no more the images that we understand at

play06:43

the input layer we know what we are

play06:45

providing as input but when you go to

play06:47

higher layers you really don't know what

play06:49

you're providing as input so it becomes

play06:51

a little bit more difficult to

play06:53

understand what's

play06:55

happening however if you take the

play06:57

filters of the first layer alone across

play07:00

various models and data sets Hope by now

play07:03

you're familiar with the various CNN

play07:06

models such as alexnet rest net dens net

play07:08

vgg so on and so forth so if you had to

play07:11

look at the filters of the first layer

play07:14

first convolutional layer across all of

play07:16

these models you get very similar kinds

play07:20

of filters and it's generally called the

play07:22

Gaborik filters fatigue why is that so

play07:26

recall the gabar filters discussion that

play07:29

we had earlier in the course where we

play07:31

said a gabar filter is like a

play07:35

combination of a gosan and a

play07:38

sinusoid so and you can change the scale

play07:41

and you can change the orientation of

play07:43

the gabar filter and you end up

play07:46

detecting edges in different

play07:48

orientations uh you perhaps end up

play07:50

detecting uh different striations

play07:53

checkerboard patterns so on and so forth

play07:56

which is exactly what we see as the CNN

play07:59

learning on its own too so that's the

play08:02

reason why we call this entire

play08:05

visualization of the filters of the

play08:08

first convolutional layer A Gabor like

play08:10

filter fatig by fatigue here we just

play08:13

mean it's exactly the same across all of

play08:16

these models and data

play08:18

sets another option other than

play08:21

visualizing the filters in different

play08:24

layers when we talk about visualizing

play08:26

the filters remember it's 11 cross 11 or

play08:28

a 7 cross 7 or whatever be the size of

play08:30

the filter you simply have to plot it as

play08:32

an

play08:33

image another thing that you can do is

play08:36

to visualize the representation space

play08:40

learned by the

play08:41

CNN what do we mean if you took the Alex

play08:45

net remember that the output of fc7 or

play08:49

the fully connected layer after the

play08:51

seventh the in the seventh position of

play08:54

the depth of the network which we denote

play08:56

as fc7 is a 496 diens di menion Vector

play09:00

right that's the layer immediately

play09:02

before the

play09:03

classifier so what we can do is take all

play09:07

the images in your test set or

play09:10

validation set for that matter and you

play09:13

forward propagate those images until

play09:16

this particular layer and collect all

play09:19

these 496 dimensional

play09:22

vectors what do we do with them you can

play09:25

now visualize the space of these FC

play09:29

feature vectors by reducing the

play09:32

dimensionality from 496 to any dimension

play09:36

of your choice but for Simplicity let's

play09:38

say two

play09:40

Dimensions how do we do this once again

play09:44

hopefully you have done a basic machine

play09:46

learning course and you know that you

play09:47

can use any dimensionality reduction

play09:50

method to be able to do this a simple

play09:52

example could be principal component

play09:55

analysis so you take all of those 49 to6

play09:58

dimensional vectors of several input

play09:59

images and you do a PCA on top of them

play10:03

to bring all of them into a two-

play10:05

dimensional space a more popular

play10:08

approach which is considered to be a

play10:10

very strong dimensionality reduction

play10:11

method is also TSN which stands for T

play10:15

stochastic neighborhood embedding this

play10:17

was a method once again developed by

play10:20

Hinton along with Vander Martin in 2008

play10:24

uh we also have a link for this towards

play10:26

the end of this lecture so you can play

play10:28

around with TSN if you like to

play10:30

understand it more so when you apply TSN

play10:33

on the representations that you get as

play10:37

output of the CNN in the penalty mate

play10:39

layer you end up getting a result such

play10:42

as this this is in specific for the

play10:44

mnist data set where you have 10 classes

play10:47

this is the handwritten digit data set

play10:49

so you see here that each class

play10:52

invariably go goes to its own cluster

play10:55

this seems to tell us while we cannot in

play10:57

reality visualize a 96 dimensional Space

play11:01

by bringing it down to two Dimensions we

play11:04

understand that the classes the

play11:06

representations belonging to different

play11:08

classes are fairly well separated into

play11:11

different clusters and why is that

play11:14

important now developing a

play11:16

classification algorithm on these

play11:18

representations becomes an easy task and

play11:22

that's why having a classification layer

play11:24

right after that penultimate layer of

play11:27

representations makes the entire CNN

play11:29

work well to

play11:32

classify so here is an example of the

play11:35

same for image net so where this is a

play11:39

plot a two-dimensional plot of various

play11:42

images in the image net data set taken

play11:45

to 496 dimensional Space by alexnet and

play11:48

then brought down to two- dimensional

play11:50

space and plotted on a two dimensional

play11:53

map the only thing we're doing here is

play11:55

we are putting the respective image on

play11:58

each location

play11:59

just for understanding what's really

play12:01

happening so this is just a huge map and

play12:04

if you had to look at one particular

play12:06

part of it let's try to zoom in into one

play12:09

particular part of it you see that all

play12:13

the images corresponding to say a

play12:17

field seem to be coming together in this

play12:21

space of representations for this matter

play12:24

if you scroll around and see other parts

play12:27

of it you will see that similar objects

play12:30

you can see at many points of these uh

play12:32

embeddings here that at many points you

play12:35

see very similar objects being group

play12:37

together you see all cars somewhere here

play12:41

so on and so forth this tells us that

play12:45

these embeddings or representations that

play12:47

you get out of the penultimate layer

play12:50

actually capture semantic nature of

play12:52

these images and

play12:55

similar objects of uh objects of similar

play12:58

semantics are grouped together while

play13:00

objects of different semantics are far

play13:03

apart from each each other so this gives

play13:05

us an understanding the CNN's

play13:08

representations seem to be capturing the

play13:12

semantics keep this in mind when we

play13:14

talked about handcrafted features and

play13:17

learn representations this is what we

play13:20

were talking about in handcrafted

play13:22

features such as sift or hog or LBP you

play13:26

have to decide what may be useful for a

play13:28

given app application and then hand

play13:31

design that filter that you want to use

play13:34

as a representation of the image after

play13:36

which you may apply a machine learning

play13:38

algorithm but now we are letting the

play13:41

neural network the CNN in particular

play13:43

automatically learn these

play13:44

representations that it needs to solve a

play13:47

particular

play13:51

task here is a visualization of the con

play13:54

five feature Maps uh in alexnet the con

play13:58

five feature map is 128 cross 13 cross

play14:00

13 so there are 128 feature Maps each

play14:04

13x13 if you now visualize them as

play14:07

grayscale images you can see something

play14:10

interesting here so when this specific

play14:12

image with two people is given as input

play14:15

you see that one of the filters or

play14:18

actually there are quite a few of them

play14:19

in fact seem to be capturing the fact

play14:21

that there are two entities in the image

play14:24

so this could give you a hint that the

play14:27

later layers in the CNN are able to

play14:30

capture these higher level semantics of

play14:33

the objects in the

play14:38

images another way of visualizing and

play14:41

understanding CNN is you could extend

play14:44

the same thought and consider a trained

play14:47

CNN uh remember that all of this

play14:49

analysis is for a trained CNN after

play14:51

training the CNN you want to understand

play14:53

what it has learned remember that's the

play14:55

context in which we're talking about

play14:57

this so you can consider a CNN and

play15:00

consider any single neuron in its

play15:02

intermediate layers so let's consider

play15:05

that particular one in green now you can

play15:07

try to

play15:09

visualize which images cause that

play15:12

particular neuron to fire the most so

play15:15

you can give different images as input

play15:17

to the CNN and keep monitoring that

play15:19

particular neuron and see which image is

play15:22

making it fire the

play15:24

most what can we do with that now we can

play15:27

work backwards and understand that this

play15:31

particular pixel here will have a

play15:33

certain receptive field in the previous

play15:36

convolutional layer right which means

play15:39

that's the region that led to this pixel

play15:42

being formed in this particular

play15:44

convolutional layer similarly you can

play15:47

take the pixels in the previous layer

play15:50

and look at the receptive field of each

play15:52

of them and find the receptive field in

play15:55

the earlier layer in this case the first

play15:56

convolutional layer you can go further

play15:59

again and find out the receptive field

play16:02

in the original image which was

play16:04

responsible for this pixel in the third

play16:07

convolutional layer remember we were

play16:09

also discussing this when we talked

play16:11

about back propagation through CNN where

play16:13

we try to understand the receptive field

play16:16

that leads to a particular pixel getting

play16:19

affected in a particular layer it's the

play16:21

same principle here now if we took

play16:26

images and try to understand which of

play16:29

them caused a particular neuron to fire

play16:32

you end up seeing several

play16:34

patterns so in this uh set of images

play16:38

each row corresponds to one particular

play16:41

neuron that was fired and the set of

play16:44

images and the region inside the set of

play16:47

images which is shown as a white

play16:48

bounding box which caused that neuron to

play16:52

fire so here is the first row

play16:55

interestingly all of those images

play16:57

correspond to people people especially

play16:59

busts of people it looks like that

play17:02

particular neuron was capturing people

play17:06

until they bust until they chest the

play17:09

second neuron here seems to be capturing

play17:12

different dogs maybe some honeycomb kind

play17:15

of a pattern uh maybe even a US flag

play17:19

where it thought that honeycomb kind of

play17:20

a pattern is is present a third neuron

play17:24

captures a certain uh red blob across

play17:28

all of the image

play17:29

fourth neuron

play17:31

captures uh digits in images and if you

play17:36

look at the last neuron here the sixth

play17:38

row you see that it seems to be

play17:40

capturing specular reflection in all of

play17:43

the images so over time as you train

play17:46

these CNN each neuron is getting fired

play17:51

for certain artifacts in the images and

play17:54

this should uh probably go back and help

play17:58

you connect to drop out where we try to

play18:01

ensure that no particular neuron or

play18:04

weight overfits to a training data and

play18:06

we allow all neurons to learn a diverse

play18:09

set of uh artifacts in images so that's

play18:13

this should help you connect to Dropout

play18:15

in that

play18:16

sense here are further examples of the

play18:19

same idea where you take different

play18:22

neurons in uh CNN and try to see which

play18:25

images or patches of images fired that

play18:28

neuron the most once again you see a

play18:31

fairly consistent Trend here that there

play18:33

are some of them that seem to fire for I

play18:36

think this is an eye of an animal there

play18:38

is again some text in images there is

play18:41

vertical Tex in images there is faces

play18:45

there is dog faces again and so on and

play18:48

so forth and the last method that we

play18:51

will talk about in this lecture are what

play18:55

are known as occlusion experiments

play18:59

which attempt to leverage our objective

play19:03

that we finally want to understand which

play19:06

pixels in an image corresponded to the

play19:10

object recognized by the

play19:12

CNN why does this matter we'd like to

play19:16

know if the CNN looked at the cat in the

play19:19

image while giving the label as cat for

play19:22

the image or did it look at a building

play19:25

in the background or a grass on the

play19:27

ground

play19:29

remember a neural network learns

play19:32

correlations between various pixels that

play19:34

are present in your data set to be able

play19:37

to give good classification performance

play19:41

so if all of your cats in your data set

play19:43

were found only on grass a neural

play19:46

network could assume that the presence

play19:48

of grass means a presence of a cat

play19:52

obviously if you have a test set where a

play19:55

cat is found in a different background

play19:58

the neural NW work May now not be able

play19:59

to predict that as a cat so to be able

play20:02

to get that kind of a trust in the model

play20:05

that the model was indeed looking at the

play20:07

cat while calling it a cat the occlusion

play20:11

experiments do this using a specific

play20:14

methodology so given these images that

play20:17

you see here we ude out different

play20:20

patches in the image centered on each

play20:24

pixel and you see the effect on the

play20:26

predicted probability of the correct

play20:29

class let's take an example so you can

play20:32

see a gray patch here on the image so

play20:35

you olude that part of the image fill it

play20:38

with gray and send the whole image as

play20:41

input to the CNN and you would get a

play20:43

particular probability for the correct

play20:45

label in this case which is Pomeranian

play20:48

so that is plotted as a probability in

play20:51

that particular location similarly you

play20:54

would gray out a patch here send the

play20:58

full image as input to a CNN get a

play21:01

probability for a Pomeranian how do you

play21:03

get the probability as the output of the

play21:05

softmax activation function and that

play21:07

probability value is plotted here in

play21:10

this image so by doing this by moving

play21:14

your gray patch across the image you

play21:17

will have an entire heat map of

play21:20

probabilities of whether a pixel or a

play21:23

patch around a pixel reduces the

play21:25

probability of a Pomeranian or does it

play21:28

keep it the same way so in this

play21:31

particular heat map red is the highest

play21:33

value and blue is a lower value so you

play21:38

notice here that when the patch is

play21:41

actually placed on the dog's face the

play21:44

probability of the image being a

play21:46

Pomeranian drops to a low value so this

play21:50

tells us that the image or the CNN model

play21:53

was in fact looking at the dog's face to

play21:56

call this a Pomeranian so in fact uh uh

play22:01

this entire discussion came out of zyler

play22:04

and fergus's work on visualizing and

play22:07

understanding convolutional neural

play22:08

networks it's a good read uh for you to

play22:11

look at at the end of this lecture they

play22:14

in fact observe that when you place a

play22:17

gray patch on the dog's face the class

play22:21

label predicted by the CNN is a tennis

play22:23

ball which is perhaps the object that

play22:27

takes precedence when you you cover the

play22:29

dog's face similarly in the second image

play22:33

you see that the true label is car wheel

play22:36

and as you keep using this gray patch

play22:39

all over the image you see that the

play22:41

probability drops the lowest to the

play22:44

lowest when the wheel is actually

play22:48

covered and the third image more

play22:50

challenging is where you have to humans

play22:53

and a dog in between them which is an

play22:55

aghan Hound which is the true label you

play22:58

once again see that when the pixels

play23:02

corresponding to the dog are uded by the

play23:04

gray patch the probability drops for the

play23:07

Afghan Hound drops low this is even

play23:10

trickier because there are humans and

play23:12

the the model could have been biased or

play23:16

affected by the presence of those humans

play23:18

that the model does well in this

play23:19

particular case so occluding the face of

play23:22

the dog causes a maximum drop in the

play23:25

prediction probability to summarize the

play23:28

meth methods that we covered in this

play23:29

lecture given a CNN we're going to call

play23:33

all of these methods as don't disturb

play23:35

the model methods which means we're not

play23:36

going to touch anything in the model we

play23:38

are only going to use the model as it is

play23:41

uh and be able to leverage various uh

play23:45

various kinds of understanding from the

play23:48

from the model so you can take the

play23:50

convolutional layer and then visualize

play23:53

your filters and kernels that's one of

play23:55

the first methods that we spoke about

play23:57

unfortunately this is only interpretable

play23:59

at the first layer and may not be

play24:01

interesting enough at higher layers and

play24:04

we also talked about the gbar like

play24:05

filter fatigue here you could also take

play24:08

any other neuron for example in a pool

play24:11

layer and visualize the patches that

play24:14

maximally activate that neuron you can

play24:16

get some understanding of what the CNN

play24:19

is learning using this kind of an

play24:21

approach the third thing that we talked

play24:23

about is you can take the fully

play24:25

connected layer the representations that

play24:27

you get at the fully connected layer and

play24:31

visualize the representation and then uh

play24:34

do a dimensionality deduction method

play24:37

such as TSN on these representations and

play24:40

you get an entire set of embeddings for

play24:42

image

play24:43

net and lastly we spoke about occlusion

play24:46

experiment where you take an input and

play24:50

perturb the input and then see what

play24:53

happens at the final classification

play24:55

layer and that gives you a set of uh a

play24:59

heat map that tells you which part of

play25:02

the image the model was looking at while

play25:04

making a

play25:06

prediction recommended readings lecture

play25:09

notes of

play25:11

cs231n as well as a nice deep

play25:13

visualization toolkit demo video on web

play25:16

page by Jason yosinski I would advise

play25:18

you to look at that you can also get to

play25:21

know more about TSN and TSN

play25:23

visualizations as a dimensionality

play25:25

reduction technique on these links

play25:27

provided

play25:28

here here are some references

play25:31

[Music]

Rate This

5.0 / 5 (0 votes)

相关标签
CNN VisualizationNeural NetworksFeature LearningAlexNetKernel FiltersActivation MapsDimensionality ReductionTSN EmbeddingSemantic ClusteringOcclusion ExperimentsDeep Learning
您是否需要英文摘要?