Few-Shot Learning (2/3): Siamese Networks

Shusen Wang
2 Dec 202023:40

Summary

TLDRThis lecture explores Siamese networks, a type of deep learning model used for one-shot learning. It discusses two training methods: learning pairwise similarity scores and triplet loss. The first method involves creating positive and negative sample pairs to teach the network to distinguish between same and different classes. The second method uses triplets of images (anchor, positive, and negative samples) to refine feature extraction, aiming to minimize intra-class variation and maximize inter-class separation. The lecture concludes with the application of these networks for one-shot prediction, where the model must classify new samples based on a small support set.

Takeaways

  • πŸ” **Siamese Networks**: The lecture introduces Siamese networks, which are analogous to Siamese twins, physically connected but with separate bodies, used for learning similarities and differences between data samples.
  • πŸ“ˆ **Pairwise Similarity Training**: The first method for training Siamese networks involves learning pairwise similarity scores using positive and negative sample pairs, where positive pairs are of the same class and negative pairs are of different classes.
  • πŸ“š **Data Preparation**: For training, a large dataset is required with labeled classes, from which positive and negative samples are prepared to teach the model about sameness and difference.
  • πŸ… **Positive Samples**: Positive samples are created by randomly sampling two images from the same class, labeling them as '1' to indicate they belong to the same category.
  • 🚫 **Negative Samples**: Negative samples are constructed by sampling one image from one class and another from a different class, labeling them as '0' to signify they are of different categories.
  • 🧠 **Convolutional Neural Network (CNN)**: A CNN is used for feature extraction, with layers including convolutional, pooling, and a flattened layer, to transform input images into feature vectors.
  • πŸ”— **Feature Vector Comparison**: The network outputs two feature vectors from two input images, which are then compared by calculating the absolute difference, resulting in a vector z that represents their similarity.
  • πŸ“Š **Loss Function and Backpropagation**: The loss function measures the difference between the predicted similarity score and the actual label, using cross-entropy. Backpropagation is employed to update the model parameters to minimize this loss.
  • πŸ”„ **Model Update Process**: Both the convolutional and fully connected layers of the network are updated during training, with gradients flowing from the loss function to the dense layers and then to the convolutional layers.
  • πŸ“Š **Triplet Loss Method**: An alternative training method involves triplet loss, where an anchor, a positive sample, and a negative sample are used to encourage the network to learn a feature space where intra-class distances are small and inter-class distances are large.
  • πŸ”Ž **One-Shot Prediction**: After training, the Siamese network can be used for one-shot prediction, where the model makes predictions based on a support set and a query image, identifying the class of the query by comparing it to the support set samples.

Q & A

  • What is the analogy behind the name 'Siamese Network'?

    -The name 'Siamese Network' is an analogy to 'Siamese twins', where two individuals are physically connected. In the context of neural networks, it refers to the structure where two identical subnetworks share weights and are used to process two input samples.

  • How are positive samples defined in the training of a Siamese network?

    -Positive samples in a Siamese network are defined as pairs of images that belong to the same class. These samples are used to teach the network to recognize similarities between items of the same category.

  • What role do negative samples play in training a Siamese network?

    -Negative samples are pairs of images from different classes. They are used to teach the network to distinguish between different categories, ensuring that the network learns to identify dissimilarities.

  • How does the convolutional neural network contribute to feature extraction in a Siamese network?

    -The convolutional neural network in a Siamese network is responsible for extracting feature vectors from the input images. It processes the images through convolutional and pooling layers, and outputs feature vectors that are then used to calculate similarity scores.

  • What is the significance of the feature vectors h1 and h2 in a Siamese network?

    -The feature vectors h1 and h2 represent the outputs of the convolutional neural network for two input images. The difference between these vectors is used to calculate a similarity score, which is a key aspect of the Siamese network's functionality.

  • Why is the output of a Siamese network a scalar value between 0 and 1?

    -The output of a Siamese network is a scalar value between 0 and 1 because it represents the similarity score between two input images. A value close to 1 indicates high similarity (same class), while a value close to 0 indicates low similarity (different classes).

  • What is the purpose of the sigmoid activation function in the output layer of a Siamese network?

    -The sigmoid activation function in the output layer of a Siamese network is used to squash the output into a value between 0 and 1, which corresponds to the probability of the two inputs being from the same class.

  • How does the triplet loss method differ from the pairwise similarity method in training a Siamese network?

    -The triplet loss method involves training the network using three images: an anchor, a positive sample, and a negative sample. The goal is to minimize the distance between the anchor and positive sample while maximizing the distance to the negative sample, unlike the pairwise method which focuses on pairs of images.

  • What is the concept of 'one-shot prediction' in the context of Siamese networks?

    -One-shot prediction refers to the ability of a Siamese network to make predictions based on very few examples, typically just one. This is particularly useful in few-shot learning scenarios where the model must generalize from limited data.

  • How does the support set assist in one-shot prediction with a Siamese network?

    -The support set provides a set of examples from known classes that the network can use for comparison when making predictions about a new, unseen query image. This allows the network to find the most similar class to the query, even if it was not present in the training data.

Outlines

00:00

πŸ”¬ Introduction to Siamese Networks

This paragraph introduces the concept of Siamese networks, drawing an analogy with Siamese twins to explain the connection between two neural networks. It discusses two methods for training Siamese networks: one that learns pairwise similarity scores and another that uses a large dataset with labeled classes to prepare positive and negative samples. Positive samples are pairs of images from the same class, labeled as '1', while negative samples are pairs from different classes, labeled as '0'. The paragraph also describes the construction of a convolutional neural network for feature extraction, which outputs feature vectors from two input images. The feature vectors are then processed to calculate a similarity score between the two inputs, which should be close to '1' for the same class and close to '0' for different classes.

05:00

πŸ€– Training Siamese Networks with Pairwise Similarity

This paragraph delves into the training process of Siamese networks using pairwise similarity. It explains how the network is trained using positive and negative samples, with the goal of minimizing the difference between the predicted scalar and the target label using a loss function, such as cross-entropy. The training involves updating the model parameters through backpropagation and gradient descent. The paragraph also describes the structure of the Siamese network, which consists of a shared convolutional neural network for feature extraction and fully connected layers that process the difference between feature vectors to output a scalar similarity score.

10:02

🐯 Triplet Loss for Siamese Network Training

The third paragraph introduces the triplet loss method for training Siamese networks, which involves selecting three images as a training sample: an anchor, a positive sample from the same class as the anchor, and a negative sample from a different class. The paragraph explains how the convolutional neural network extracts feature vectors from these images and calculates the squared L2 distance between the positive sample and the anchor (d_positive) and between the negative sample and the anchor (d_negative). The goal is to minimize d_positive while maximizing d_negative, with a margin (alpha) to ensure that the negative distance is significantly larger than the positive distance, indicating that the network can distinguish between different classes.

15:04

πŸ“Š Applying Triplet Loss in Feature Space

This paragraph further discusses the application of triplet loss in the feature space. It explains the concept of encouraging the positive distance (between the anchor and positive sample) to be small and the negative distance (between the anchor and negative sample) to be large, with a margin (alpha) to ensure proper classification. The loss function is defined such that if the negative distance is not sufficiently larger than the positive distance plus the margin, a loss is incurred. This approach helps in updating the model parameters to better separate feature vectors of different classes in the feature space.

20:04

πŸ” One-Shot Prediction with Siamese Networks

The final paragraph discusses the application of trained Siamese networks for one-shot prediction. It explains how the network can be used to classify a query image based on a support set containing classes not present in the training set. The process involves comparing the query image with images in the support set to find similarity scores. The paragraph also summarizes the two methods of training Siamese networks: using pairwise similarity scores and triplet loss. It concludes by emphasizing the importance of the support set in providing additional information for classifying queries that do not appear in the training set.

Mindmap

Keywords

πŸ’‘Siamese Networks

Siamese Networks are a type of neural network used for learning to recognize whether two inputs are similar or dissimilar to each other. Named after Siamese twins who are physically connected, these networks share weights in their architecture. In the video, the Siamese Network is used for training on a large dataset to learn the similarity and difference between classes. The network is trained to output a similarity score between 0 and 1, where 1 indicates the same class and 0 indicates a different class.

πŸ’‘Pairwise Similarity

Pairwise similarity refers to the process of comparing two items to determine how similar they are. In the context of the video, the Siamese Network is trained to learn pairwise similarity scores, which are used to label whether two input images belong to the same class or not. Positive samples are labeled as 1, indicating they are of the same kind, while negative samples are labeled as 0, indicating they are of different kinds.

πŸ’‘Convolutional Neural Network (CNN)

A Convolutional Neural Network is a type of deep learning model that is particularly effective for image processing tasks. In the video, a CNN is used for feature extraction from images. The CNN processes the input images and outputs feature vectors, which are then used to calculate the similarity between the images. The CNN is a critical component of the Siamese Network, as it provides the necessary feature representation for similarity comparison.

πŸ’‘Feature Vector

A feature vector is a set of numbers that represent the properties of an object in a mathematical manner. In the video, the feature vectors are the outputs of the CNN and are used to represent the characteristics of the input images. The Siamese Network compares these feature vectors to determine the similarity between two images.

πŸ’‘Backpropagation

Backpropagation is a method used to update the weights of a neural network by calculating the gradient of the loss function with respect to the network's parameters. In the video, backpropagation is used to update the parameters of both the convolutional layers and the dense layers of the Siamese Network, allowing the network to learn from the training data and improve its ability to predict similarity.

πŸ’‘Gradient Descent

Gradient descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of the steepest descent as defined by the negative of the gradient. In the video, gradient descent is used to update the model parameters of the Siamese Network after calculating the gradients during backpropagation, with the goal of minimizing the loss function.

πŸ’‘Loss Function

A loss function is a measure of how well the model's predictions match the actual data. In the video, the loss function is used to measure the difference between the predicted similarity score and the actual label (0 or 1). The goal of training the Siamese Network is to minimize this loss function, which helps the network to learn to make more accurate predictions.

πŸ’‘Triplet Loss

Triplet loss is a loss function used in training neural networks for tasks such as face recognition, where the goal is to learn a distance metric between feature vectors. In the video, triplet loss is introduced as an alternative training method for the Siamese Network, where the training data consists of a triplet of images: an anchor, a positive sample, and a negative sample. The loss function encourages the network to learn a metric where the distance between the anchor and positive sample is small, and the distance between the anchor and negative sample is large.

πŸ’‘One-Shot Learning

One-shot learning refers to a machine learning paradigm where the model is expected to learn from only a single example of a new class. In the video, after training the Siamese Network, it is used for one-shot prediction where the model must classify a query image based on a support set that contains only one sample per class. This is a challenging task because the query's class is not present in the training set.

πŸ’‘Support Set

A support set is a collection of examples used in few-shot learning tasks, where the goal is to make predictions based on a small number of samples per class. In the video, the support set is used in one-shot learning, where the model must classify a query image by comparing it to the images in the support set. The support set provides the context necessary for the model to make predictions about new classes.

Highlights

Introduction to Siamese networks, inspired by the physical connection of Siamese twins.

Two training methods for Siamese networks: learning pairwise similarity scores and triplet loss.

Explanation of positive and negative samples for training, with examples of tigers, cars, and elephants.

Building a convolutional neural network for feature extraction, including convolutional and pooling layers.

Training the neural network with prepared pairs and labels to output feature vectors.

Emphasis on the shared convolutional unit for feature extraction in Siamese networks.

Calculation of the difference vector (z) between feature vectors and its processing through dense layers.

Use of the sigmoid activation function to obtain a similarity score between 0 and 1.

Network structure analogy to Siamese twins, highlighting the connection and individuality.

Loss function definition using cross-entropy for training with positive and negative samples.

Backpropagation and gradient descent for updating model parameters in both convolutional and dense layers.

One-shot prediction using the trained Siamese network with a support set and query image.

Triplet loss method for training, involving an anchor, a positive sample, and a negative sample.

Definition of the triplet loss function with a margin (alpha) to encourage class separation.

Feature space explanation with examples of anchor, positive, and negative samples.

Training objective to minimize the positive distance and maximize the negative distance in feature space.

Application of the trained Siamese network for one-shot prediction with a new query and support set.

Summary of few-shot learning challenges and the role of the support set in prediction.

Conclusion and transition to the next lecture on transfer learning and fine-tuning for few-shot learning.

Transcripts

play00:01

in this lecture we studied science

play00:03

network

play00:04

the name siamese is an analogy to

play00:07

siamese twins

play00:09

science twins mean two babies born

play00:12

physically connected to each other

play00:17

i will introduce two ways of training

play00:19

siamese networks

play00:21

the first method is learning pairwise

play00:23

similarity scores

play00:25

you can read the two papers listed below

play00:28

for more details

play00:33

the site miss network needs to be

play00:35

trained using a big data set

play00:38

the data are labeled each class contains

play00:41

many samples

play00:45

we need to prepare positive samples and

play00:47

negative samples

play00:48

using the training set positive samples

play00:52

tell the model what kind of things are

play00:55

of the same kind

play00:57

negative samples tell what are different

play00:59

kinds

play01:01

positive samples are obtained in this

play01:03

way

play01:05

randomly sample an image from the

play01:07

training set

play01:08

for example we get this tiger

play01:11

then sample another image from the same

play01:13

class

play01:14

we get another tiger

play01:18

this is a positive sample pair and we

play01:20

label it as one

play01:22

it means the two are of the same kind

play01:27

we can do the same to get to cars and to

play01:30

elephants

play01:31

the labels of the pairs are ones

play01:37

negative samples are constructed in this

play01:39

way

play01:41

randomly sample an image from the entire

play01:43

training set

play01:45

for example we get this car

play01:48

then exclude the car class and randomly

play01:51

sample an image

play01:52

from the rest of the training set for

play01:55

example

play01:56

we get this elephant label the pair

play01:59

as zero zero means the two images are

play02:02

different

play02:04

do the same and get negative sample

play02:06

pairs such as the hot

play02:08

key and tiger pair the elephant and the

play02:11

cow pair

play02:13

label the negative pairs as zeros

play02:19

let's build a convolutional neural

play02:20

network for feature extraction

play02:23

the neural network can have

play02:25

convolutional layers pulling layers

play02:27

and a flattened layer the input is an

play02:31

image denote the input image

play02:34

by x the output

play02:38

is the feature vector denote the output

play02:41

feature vector

play02:42

by f of x

play02:46

let's train the neural network using the

play02:48

training data we have prepared

play02:51

we have prepared many pairs such as the

play02:53

two tigers

play02:54

as well as the labels the tigers are

play02:58

from the same class

play03:00

so the label of this pair is one

play03:04

the two tigers are the input of the new

play03:07

network

play03:09

the convolutional new network we built a

play03:11

minute ago

play03:12

is denoted by function f

play03:16

the neural network outputs two feature

play03:18

vectors extracted from the two input

play03:20

images

play03:22

the feature vectors of the two images

play03:25

are denoted by

play03:26

h sub 1 and h sub 2.

play03:30

i want to emphasize that there is only

play03:32

one convolution united work

play03:34

for feature extraction the two f

play03:38

functions

play03:38

are the same neural work we built

play03:43

then calculate h sub 1 minus h sub 2.

play03:48

the result is a vector take the absolute

play03:52

value of every entry in the vector

play03:56

let the result be vector z

play04:00

z is a difference between the two

play04:02

feature vectors

play04:07

then use several dense layers to process

play04:10

the z

play04:11

vector the output of the layers

play04:14

is a scalar

play04:18

finally apply the sigmoid activation

play04:21

function

play04:22

upon the scalar and obtain a number

play04:24

between 0

play04:25

and 1. the final

play04:28

output measures the similarity between

play04:31

the two inputs

play04:33

if the two input images are from the

play04:36

same class

play04:37

then the output should be close to 1.

play04:40

otherwise if the input images are from

play04:43

different classes

play04:44

then the output should be close to zero

play04:48

by looking at the network structure you

play04:51

can easily understand

play04:52

why the network is called siamese

play04:54

network

play04:56

siamese twins are connected to each

play04:58

other

play05:00

in the figure the twins have their own

play05:03

bodies

play05:04

but their heads are connected

play05:09

we have previously prepared the label

play05:13

the two images are both tigers so the

play05:16

label is one

play05:18

we set one at the target we hope the

play05:21

scalar output by the network

play05:23

is close to the target 1.

play05:28

we use a loss function to measure the

play05:30

difference

play05:31

between the target and the predicted

play05:32

scalar

play05:34

the loss can be the cross entropy of the

play05:37

target and the prediction

play05:39

it matters the difference between the

play05:41

two

play05:44

having the loss we can use back

play05:47

propagation for calculating the

play05:49

gradients

play05:50

then we perform gradient descent to

play05:52

update the model parameters

play05:56

the model has two parts one is the

play05:59

convolutional new network

play06:00

denoted by f it is for extracting

play06:04

features from the input images

play06:07

note that the two apps are exactly the

play06:09

same convolutional new network

play06:12

they have the same model parameters

play06:16

the other part is the fully connected

play06:19

layers which

play06:20

map vector z to a scalar between 0 and

play06:22

1.

play06:24

during training both parts will be

play06:27

updated

play06:30

using back propagation the gradient

play06:34

flows from the loss function to vector z

play06:37

and the parameters of the dense layers

play06:41

knowing the gradient of the loss with

play06:43

respect to the dance layers parameters

play06:45

we can update the parameters by a

play06:48

gradient descent

play06:52

further property gradient from the

play06:54

vector z

play06:55

to the convolutional unit work f

play06:59

then we can use the gradient to update

play07:01

the parameters of the convolutional

play07:03

layers

play07:05

to this end we have performed one round

play07:08

of update

play07:12

to train the model we prepare the same

play07:15

number of positive samples and negative

play07:17

samples

play07:19

negative samples mean the two images are

play07:21

different objects

play07:24

the pair of negative sample is labeled

play07:27

as zero

play07:28

we hope the prediction by the network is

play07:31

close to zero

play07:33

which means the network knows the two

play07:35

inputs are different

play07:38

then do the same as before to property

play07:41

gradient from the loss to dance layers

play07:43

and convolutional layers

play07:45

to update the model parameters

play07:50

after training the model we can use a

play07:53

model for one shot prediction

play07:56

in this example the support set is six

play07:59

with one shot there are six classes

play08:03

each class has one sample note that the

play08:07

six classes are not in the training set

play08:10

which is why future learning is

play08:11

difficult

play08:15

now we have a query we know the query

play08:18

must be among the six classes in the

play08:21

support set

play08:22

we need to choose one out of the six

play08:24

classes

play08:28

we can compare the query image with the

play08:31

images in the support set

play08:33

one by one taking the query

play08:36

and the fox as input the same siamese

play08:39

network

play08:40

predicts a score between 0 and 1.

play08:43

then we know the similarity between the

play08:45

query and fox

play08:47

is 0.2

play08:51

then compare the query image with the

play08:53

squirrel

play08:54

and get a similarity score of 0.9

play08:59

do the same to find all the similarity

play09:02

scores

play09:04

then identify the largest amount

play09:07

similarity scores

play09:10

we found the query most similar to

play09:12

squirrel

play09:13

the similarity score is 0.9

play09:17

so we predict that the query is a

play09:20

squirrel

play09:24

we have trained the siamese network for

play09:26

computing pairwise similarity scores

play09:29

next let's study another method for

play09:32

training the science network

play09:34

the method is triplet loss

play09:40

we prepared the training data in a

play09:42

different way

play09:44

having such a training set each time we

play09:47

need to select

play09:48

three images as a training sample

play09:51

it works in three steps first

play09:55

randomly select an image from the entire

play09:57

training set

play09:58

as the anchor

play10:02

for example this tiger is selected

play10:05

it becomes the anchor

play10:09

record the anchor

play10:13

then from the sim class randomly select

play10:17

an

play10:17

image as positive sample we got another

play10:21

tiger

play10:23

record the positive sample the positive

play10:27

sample and the anchor

play10:28

are from the same class

play10:33

lastly exclude the tiger class

play10:36

and randomly sample an image from the

play10:38

rest of the training set

play10:39

as a negative sample we happen to get

play10:43

this elephant

play10:46

record the negative sample

play10:50

the negative sample and the anchor are

play10:53

from different classes

play10:58

to this end we have an anchor x a

play11:02

a positive sample x positive

play11:05

and a negative sample x negative

play11:11

feed the three images to the

play11:12

convolutional new network

play11:14

f although the name is siamese network

play11:18

there is only one convolutional new

play11:20

network

play11:21

the three apps are the same neural

play11:23

network

play11:26

the convolutional new network extracts

play11:29

three feature vectors

play11:30

from the three images

play11:34

calculate the distance between the

play11:36

positive sample and the anchor

play11:37

in the vector space let d positive

play11:41

be the square distance a take the

play11:44

squared out two norm

play11:46

of f x positive minus f x a

play11:53

we do the same to find the distance

play11:55

between the negative sample

play11:56

and the anchor in the facial space

play12:00

let the negative be the squared l2 norm

play12:03

of f x a minus f x negative

play12:08

we hope the learned network f has such a

play12:11

property

play12:12

the feature vectors from the same class

play12:14

are nearby

play12:15

while feature vectors from different

play12:17

classes can be well separated

play12:22

thus the positive should be small

play12:24

because the positive sample

play12:26

and the anchor belong to the same class

play12:31

the negative should be large because the

play12:34

negative samples

play12:35

and the anchor are from different

play12:37

classes

play12:41

i want to reiterate the relation among

play12:44

the three samples

play12:46

this is the feature space the

play12:50

convolutional new network

play12:51

maps images to feature vectors

play12:55

this is the anchor it is a tiger image

play12:59

its feature vector is a red dot

play13:04

this is a positive sample it's another

play13:07

tiger

play13:09

its feature vector is the green dot

play13:13

the square distance between the two

play13:15

facial vectors

play13:17

is d positive we hope the positive

play13:20

is as small as possible

play13:24

the elephant is a negative sample it's

play13:27

from a different class

play13:30

its feature vector is the blue dot the

play13:33

blue dot

play13:34

is the facial vector extracted by the

play13:36

convolutional neural network

play13:39

let the negative be the square distance

play13:42

between the blue and red feature vectors

play13:45

it means how different the negative

play13:47

sample is from the anchor

play13:51

we hope the negative is as large as

play13:53

possible

play13:55

the negative should be much larger than

play13:57

the positive

play13:59

otherwise the model cannot distinguish

play14:01

between tiger and elephant

play14:07

based on the idea we discussed let's

play14:10

define the loss function

play14:13

the positive is the squared l2 distance

play14:16

between the positive sample

play14:18

and the anchor in the fader space

play14:21

intuitively speaking the feature vectors

play14:24

of two tigers

play14:25

should be close so we should encourage

play14:28

the positive to be small

play14:33

the negative is the squared l2 distance

play14:36

between the negative sample and the

play14:38

anchor in the feature space

play14:40

we hope the feature factors of different

play14:43

classes are

play14:44

far apart the facial factor

play14:47

of an elephant should be far from tigers

play14:50

we thereby encourage the distance the

play14:53

negative

play14:53

to be big

play14:57

we can define a margin alpha

play15:00

alpha is positive it is a tuning

play15:03

hyperparameter

play15:06

ideally the negative is big and the

play15:09

positive is small

play15:11

in our example the negative is the

play15:14

distance between an elephant and a tiger

play15:17

while the positive is the distance

play15:18

between two tigers

play15:21

the negative should be much larger than

play15:23

the positive

play15:29

if the negative is greater than the

play15:31

positive by a margin of alpha

play15:34

then we believe the classification is

play15:36

correct

play15:37

and the loss is zero

play15:40

if the condition is not satisfied which

play15:43

means

play15:43

the negative is not sufficiently larger

play15:46

than the positive

play15:47

then we believe this is a failure the

play15:50

model cannot tell the difference between

play15:52

an elephant and a tiger

play15:55

there should be a loss let the loss

play15:58

be the positive plus alpha minus the

play16:01

negative

play16:03

we encourage the loss to be small a

play16:06

small loss

play16:07

means the positive is small so the two

play16:10

tigers are close in the feature space

play16:13

also a small loss means the negative is

play16:17

big

play16:18

so the tiger and elephant can be well

play16:20

separated

play16:26

in sum we define such a loss function

play16:30

if the positive plus alpha minus d

play16:32

negative

play16:33

is greater than zero then it is a loss

play16:38

it means we cannot distinguish between

play16:40

an elephant and a tiger

play16:43

otherwise if the positive plus alpha

play16:47

minus the negative is less than zero

play16:50

which means the classification is right

play16:52

then there is no loss

play16:54

the loss is simply zero

play16:59

such a loss function is called the

play17:01

triplet loss

play17:02

it is based on a triplet of samples the

play17:05

anchor

play17:06

the positive sample and the negative

play17:08

sample

play17:10

with the loss function at hand we can

play17:12

take the derivative of the loss with

play17:14

respect to the model parameters

play17:17

and then perform gradient descent to

play17:19

update the model parameters

play17:21

after an update the elephant and the

play17:24

tiger

play17:24

will be further apart in the fissure

play17:26

space whereas the tigers will be closer

play17:29

in the feature space

play17:33

after training the side miss network we

play17:36

can use a network for one shot

play17:37

prediction

play17:40

we are given a support set the classes

play17:43

in the support set

play17:44

are not contained in the training set

play17:49

we are given a query image it belongs to

play17:52

one of the six

play17:53

classes in the support set we want to

play17:55

classify the query image

play17:59

we compare the query image with the

play18:01

images in the support set

play18:02

in this way we use the convolutional new

play18:06

network

play18:07

to extract features from all the images

play18:10

then compute the distance between the

play18:12

feature vectors

play18:14

for example the query and the fox have a

play18:17

distance of 231 in the feature space

play18:22

the query image and squirrel have a

play18:24

distance of 19 in the feature space

play18:29

do the same to compute all the distances

play18:33

then find the smallest distance

play18:37

the distance between the query image and

play18:39

the square class

play18:40

is 19. 19 is the smallest

play18:44

among all the distances the model

play18:47

believes

play18:47

the query image is most similar to the

play18:50

square class

play18:52

the model thereby predict that the query

play18:54

image

play18:55

is a squirrel

play18:59

in this lecture we learn the same means

play19:01

networks for solving future

play19:03

learning let me summarize this lecture

play19:09

here is the basic idea of solving future

play19:12

learning

play19:13

we first train a siamese network on a

play19:15

large-scale training set

play19:17

the scientist network learns the

play19:19

similarity and difference between things

play19:23

after training we can use the siamese

play19:25

network to make predictions

play19:28

what makes visual learning different

play19:30

from standard supervised

play19:32

learning is that the queries class does

play19:35

not appear in the training set for

play19:38

example

play19:39

the query is a squirrel but the training

play19:41

set does not have a square class

play19:45

thus to recognize a query we must

play19:48

provide additional information

play19:51

the additional information is a support

play19:53

set

play19:55

the support set is called k-way unshot

play19:59

k-way means the support set has k

play20:02

classes

play20:03

the more classes there are the harder

play20:06

the prediction

play20:07

is n-shot means each class has end

play20:11

samples

play20:12

the fewer samples there are the harder

play20:15

the prediction is

play20:17

the hardest problem is one shot learning

play20:20

that is making a prediction based on

play20:23

only one sample

play20:26

with the trinity means network at hand

play20:29

we can compare the query with every

play20:31

sample in the support set

play20:33

to find similarity scores then

play20:36

use a sample in the support set that has

play20:39

the highest

play20:40

similarity score as a prediction

play20:45

i have elaborated on two ways of

play20:48

training the siamese network

play20:50

one is using the sine mix network to

play20:53

predict the pairwise similarity score

play20:57

each time we select a pair of two images

play21:00

as the input of the siamese network

play21:04

the images are transformed by

play21:06

convolutional layers and the dense

play21:08

layers

play21:10

the output is the similarity score

play21:12

between 0 and 1.

play21:15

1 means the two inputs are from the same

play21:17

class

play21:19

0 means the two inputs are different

play21:24

the target is either 0 or 1

play21:27

if the two inputs are from the same

play21:29

class then the target is one

play21:32

otherwise the target is zero

play21:37

define the loss function as the

play21:39

difference between the prediction

play21:41

and the target the goal of training

play21:45

is to minimize the loss equivalently

play21:48

making the prediction closer to the

play21:50

target

play21:52

in this way the learned neural network

play21:54

can predict the similarity

play21:56

between the two inputs

play22:01

the other way of training the same

play22:03

network is to use the triplet loss

play22:06

each time select three images as inputs

play22:11

they are the anchor x a the positive

play22:14

sample

play22:15

x positive and the negative sample x

play22:18

negative

play22:20

then use a convolutional new network to

play22:22

extract features from the inputs

play22:25

we obtain three feature vectors

play22:30

let the positive be the square distance

play22:33

between the positive sample

play22:35

and the anchor in the feature space

play22:40

let the negative be the squared distance

play22:42

between the negative sample

play22:44

and the anchor

play22:47

the objective of training is to make the

play22:50

positive

play22:51

as small as possible that is to make the

play22:54

two tigers close in the feature space

play22:58

also make the negative as big as

play23:01

possible

play23:02

that is to make the tiger far from the

play23:04

elephant in the feature space with the

play23:08

trained network at hand

play23:10

we can compare the query image and the

play23:12

labeled image in the feature space

play23:15

prediction is made based on the

play23:17

distances in the feature space

play23:23

i have finished teaching scientist

play23:25

networks

play23:26

thank you for watching this video the

play23:28

link to my slides can be found below the

play23:30

video

play23:32

in the next lecture i will introduce

play23:35

preaching and fine tuning for future

play23:38

learning

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Siamese NetworksOne-Shot LearningMachine LearningFeature ExtractionDeep LearningConvolutional LayersSimilarity ScoresNeural NetworksAI TrainingImage Recognition