Few-Shot Learning (2/3): Siamese Networks
Summary
TLDRThis lecture explores Siamese networks, a type of deep learning model used for one-shot learning. It discusses two training methods: learning pairwise similarity scores and triplet loss. The first method involves creating positive and negative sample pairs to teach the network to distinguish between same and different classes. The second method uses triplets of images (anchor, positive, and negative samples) to refine feature extraction, aiming to minimize intra-class variation and maximize inter-class separation. The lecture concludes with the application of these networks for one-shot prediction, where the model must classify new samples based on a small support set.
Takeaways
- đ **Siamese Networks**: The lecture introduces Siamese networks, which are analogous to Siamese twins, physically connected but with separate bodies, used for learning similarities and differences between data samples.
- đ **Pairwise Similarity Training**: The first method for training Siamese networks involves learning pairwise similarity scores using positive and negative sample pairs, where positive pairs are of the same class and negative pairs are of different classes.
- đ **Data Preparation**: For training, a large dataset is required with labeled classes, from which positive and negative samples are prepared to teach the model about sameness and difference.
- đ **Positive Samples**: Positive samples are created by randomly sampling two images from the same class, labeling them as '1' to indicate they belong to the same category.
- đ« **Negative Samples**: Negative samples are constructed by sampling one image from one class and another from a different class, labeling them as '0' to signify they are of different categories.
- đ§ **Convolutional Neural Network (CNN)**: A CNN is used for feature extraction, with layers including convolutional, pooling, and a flattened layer, to transform input images into feature vectors.
- đ **Feature Vector Comparison**: The network outputs two feature vectors from two input images, which are then compared by calculating the absolute difference, resulting in a vector z that represents their similarity.
- đ **Loss Function and Backpropagation**: The loss function measures the difference between the predicted similarity score and the actual label, using cross-entropy. Backpropagation is employed to update the model parameters to minimize this loss.
- đ **Model Update Process**: Both the convolutional and fully connected layers of the network are updated during training, with gradients flowing from the loss function to the dense layers and then to the convolutional layers.
- đ **Triplet Loss Method**: An alternative training method involves triplet loss, where an anchor, a positive sample, and a negative sample are used to encourage the network to learn a feature space where intra-class distances are small and inter-class distances are large.
- đ **One-Shot Prediction**: After training, the Siamese network can be used for one-shot prediction, where the model makes predictions based on a support set and a query image, identifying the class of the query by comparing it to the support set samples.
Q & A
What is the analogy behind the name 'Siamese Network'?
-The name 'Siamese Network' is an analogy to 'Siamese twins', where two individuals are physically connected. In the context of neural networks, it refers to the structure where two identical subnetworks share weights and are used to process two input samples.
How are positive samples defined in the training of a Siamese network?
-Positive samples in a Siamese network are defined as pairs of images that belong to the same class. These samples are used to teach the network to recognize similarities between items of the same category.
What role do negative samples play in training a Siamese network?
-Negative samples are pairs of images from different classes. They are used to teach the network to distinguish between different categories, ensuring that the network learns to identify dissimilarities.
How does the convolutional neural network contribute to feature extraction in a Siamese network?
-The convolutional neural network in a Siamese network is responsible for extracting feature vectors from the input images. It processes the images through convolutional and pooling layers, and outputs feature vectors that are then used to calculate similarity scores.
What is the significance of the feature vectors h1 and h2 in a Siamese network?
-The feature vectors h1 and h2 represent the outputs of the convolutional neural network for two input images. The difference between these vectors is used to calculate a similarity score, which is a key aspect of the Siamese network's functionality.
Why is the output of a Siamese network a scalar value between 0 and 1?
-The output of a Siamese network is a scalar value between 0 and 1 because it represents the similarity score between two input images. A value close to 1 indicates high similarity (same class), while a value close to 0 indicates low similarity (different classes).
What is the purpose of the sigmoid activation function in the output layer of a Siamese network?
-The sigmoid activation function in the output layer of a Siamese network is used to squash the output into a value between 0 and 1, which corresponds to the probability of the two inputs being from the same class.
How does the triplet loss method differ from the pairwise similarity method in training a Siamese network?
-The triplet loss method involves training the network using three images: an anchor, a positive sample, and a negative sample. The goal is to minimize the distance between the anchor and positive sample while maximizing the distance to the negative sample, unlike the pairwise method which focuses on pairs of images.
What is the concept of 'one-shot prediction' in the context of Siamese networks?
-One-shot prediction refers to the ability of a Siamese network to make predictions based on very few examples, typically just one. This is particularly useful in few-shot learning scenarios where the model must generalize from limited data.
How does the support set assist in one-shot prediction with a Siamese network?
-The support set provides a set of examples from known classes that the network can use for comparison when making predictions about a new, unseen query image. This allows the network to find the most similar class to the query, even if it was not present in the training data.
Outlines
đŹ Introduction to Siamese Networks
This paragraph introduces the concept of Siamese networks, drawing an analogy with Siamese twins to explain the connection between two neural networks. It discusses two methods for training Siamese networks: one that learns pairwise similarity scores and another that uses a large dataset with labeled classes to prepare positive and negative samples. Positive samples are pairs of images from the same class, labeled as '1', while negative samples are pairs from different classes, labeled as '0'. The paragraph also describes the construction of a convolutional neural network for feature extraction, which outputs feature vectors from two input images. The feature vectors are then processed to calculate a similarity score between the two inputs, which should be close to '1' for the same class and close to '0' for different classes.
đ€ Training Siamese Networks with Pairwise Similarity
This paragraph delves into the training process of Siamese networks using pairwise similarity. It explains how the network is trained using positive and negative samples, with the goal of minimizing the difference between the predicted scalar and the target label using a loss function, such as cross-entropy. The training involves updating the model parameters through backpropagation and gradient descent. The paragraph also describes the structure of the Siamese network, which consists of a shared convolutional neural network for feature extraction and fully connected layers that process the difference between feature vectors to output a scalar similarity score.
đŻ Triplet Loss for Siamese Network Training
The third paragraph introduces the triplet loss method for training Siamese networks, which involves selecting three images as a training sample: an anchor, a positive sample from the same class as the anchor, and a negative sample from a different class. The paragraph explains how the convolutional neural network extracts feature vectors from these images and calculates the squared L2 distance between the positive sample and the anchor (d_positive) and between the negative sample and the anchor (d_negative). The goal is to minimize d_positive while maximizing d_negative, with a margin (alpha) to ensure that the negative distance is significantly larger than the positive distance, indicating that the network can distinguish between different classes.
đ Applying Triplet Loss in Feature Space
This paragraph further discusses the application of triplet loss in the feature space. It explains the concept of encouraging the positive distance (between the anchor and positive sample) to be small and the negative distance (between the anchor and negative sample) to be large, with a margin (alpha) to ensure proper classification. The loss function is defined such that if the negative distance is not sufficiently larger than the positive distance plus the margin, a loss is incurred. This approach helps in updating the model parameters to better separate feature vectors of different classes in the feature space.
đ One-Shot Prediction with Siamese Networks
The final paragraph discusses the application of trained Siamese networks for one-shot prediction. It explains how the network can be used to classify a query image based on a support set containing classes not present in the training set. The process involves comparing the query image with images in the support set to find similarity scores. The paragraph also summarizes the two methods of training Siamese networks: using pairwise similarity scores and triplet loss. It concludes by emphasizing the importance of the support set in providing additional information for classifying queries that do not appear in the training set.
Mindmap
Keywords
đĄSiamese Networks
đĄPairwise Similarity
đĄConvolutional Neural Network (CNN)
đĄFeature Vector
đĄBackpropagation
đĄGradient Descent
đĄLoss Function
đĄTriplet Loss
đĄOne-Shot Learning
đĄSupport Set
Highlights
Introduction to Siamese networks, inspired by the physical connection of Siamese twins.
Two training methods for Siamese networks: learning pairwise similarity scores and triplet loss.
Explanation of positive and negative samples for training, with examples of tigers, cars, and elephants.
Building a convolutional neural network for feature extraction, including convolutional and pooling layers.
Training the neural network with prepared pairs and labels to output feature vectors.
Emphasis on the shared convolutional unit for feature extraction in Siamese networks.
Calculation of the difference vector (z) between feature vectors and its processing through dense layers.
Use of the sigmoid activation function to obtain a similarity score between 0 and 1.
Network structure analogy to Siamese twins, highlighting the connection and individuality.
Loss function definition using cross-entropy for training with positive and negative samples.
Backpropagation and gradient descent for updating model parameters in both convolutional and dense layers.
One-shot prediction using the trained Siamese network with a support set and query image.
Triplet loss method for training, involving an anchor, a positive sample, and a negative sample.
Definition of the triplet loss function with a margin (alpha) to encourage class separation.
Feature space explanation with examples of anchor, positive, and negative samples.
Training objective to minimize the positive distance and maximize the negative distance in feature space.
Application of the trained Siamese network for one-shot prediction with a new query and support set.
Summary of few-shot learning challenges and the role of the support set in prediction.
Conclusion and transition to the next lecture on transfer learning and fine-tuning for few-shot learning.
Transcripts
in this lecture we studied science
network
the name siamese is an analogy to
siamese twins
science twins mean two babies born
physically connected to each other
i will introduce two ways of training
siamese networks
the first method is learning pairwise
similarity scores
you can read the two papers listed below
for more details
the site miss network needs to be
trained using a big data set
the data are labeled each class contains
many samples
we need to prepare positive samples and
negative samples
using the training set positive samples
tell the model what kind of things are
of the same kind
negative samples tell what are different
kinds
positive samples are obtained in this
way
randomly sample an image from the
training set
for example we get this tiger
then sample another image from the same
class
we get another tiger
this is a positive sample pair and we
label it as one
it means the two are of the same kind
we can do the same to get to cars and to
elephants
the labels of the pairs are ones
negative samples are constructed in this
way
randomly sample an image from the entire
training set
for example we get this car
then exclude the car class and randomly
sample an image
from the rest of the training set for
example
we get this elephant label the pair
as zero zero means the two images are
different
do the same and get negative sample
pairs such as the hot
key and tiger pair the elephant and the
cow pair
label the negative pairs as zeros
let's build a convolutional neural
network for feature extraction
the neural network can have
convolutional layers pulling layers
and a flattened layer the input is an
image denote the input image
by x the output
is the feature vector denote the output
feature vector
by f of x
let's train the neural network using the
training data we have prepared
we have prepared many pairs such as the
two tigers
as well as the labels the tigers are
from the same class
so the label of this pair is one
the two tigers are the input of the new
network
the convolutional new network we built a
minute ago
is denoted by function f
the neural network outputs two feature
vectors extracted from the two input
images
the feature vectors of the two images
are denoted by
h sub 1 and h sub 2.
i want to emphasize that there is only
one convolution united work
for feature extraction the two f
functions
are the same neural work we built
then calculate h sub 1 minus h sub 2.
the result is a vector take the absolute
value of every entry in the vector
let the result be vector z
z is a difference between the two
feature vectors
then use several dense layers to process
the z
vector the output of the layers
is a scalar
finally apply the sigmoid activation
function
upon the scalar and obtain a number
between 0
and 1. the final
output measures the similarity between
the two inputs
if the two input images are from the
same class
then the output should be close to 1.
otherwise if the input images are from
different classes
then the output should be close to zero
by looking at the network structure you
can easily understand
why the network is called siamese
network
siamese twins are connected to each
other
in the figure the twins have their own
bodies
but their heads are connected
we have previously prepared the label
the two images are both tigers so the
label is one
we set one at the target we hope the
scalar output by the network
is close to the target 1.
we use a loss function to measure the
difference
between the target and the predicted
scalar
the loss can be the cross entropy of the
target and the prediction
it matters the difference between the
two
having the loss we can use back
propagation for calculating the
gradients
then we perform gradient descent to
update the model parameters
the model has two parts one is the
convolutional new network
denoted by f it is for extracting
features from the input images
note that the two apps are exactly the
same convolutional new network
they have the same model parameters
the other part is the fully connected
layers which
map vector z to a scalar between 0 and
1.
during training both parts will be
updated
using back propagation the gradient
flows from the loss function to vector z
and the parameters of the dense layers
knowing the gradient of the loss with
respect to the dance layers parameters
we can update the parameters by a
gradient descent
further property gradient from the
vector z
to the convolutional unit work f
then we can use the gradient to update
the parameters of the convolutional
layers
to this end we have performed one round
of update
to train the model we prepare the same
number of positive samples and negative
samples
negative samples mean the two images are
different objects
the pair of negative sample is labeled
as zero
we hope the prediction by the network is
close to zero
which means the network knows the two
inputs are different
then do the same as before to property
gradient from the loss to dance layers
and convolutional layers
to update the model parameters
after training the model we can use a
model for one shot prediction
in this example the support set is six
with one shot there are six classes
each class has one sample note that the
six classes are not in the training set
which is why future learning is
difficult
now we have a query we know the query
must be among the six classes in the
support set
we need to choose one out of the six
classes
we can compare the query image with the
images in the support set
one by one taking the query
and the fox as input the same siamese
network
predicts a score between 0 and 1.
then we know the similarity between the
query and fox
is 0.2
then compare the query image with the
squirrel
and get a similarity score of 0.9
do the same to find all the similarity
scores
then identify the largest amount
similarity scores
we found the query most similar to
squirrel
the similarity score is 0.9
so we predict that the query is a
squirrel
we have trained the siamese network for
computing pairwise similarity scores
next let's study another method for
training the science network
the method is triplet loss
we prepared the training data in a
different way
having such a training set each time we
need to select
three images as a training sample
it works in three steps first
randomly select an image from the entire
training set
as the anchor
for example this tiger is selected
it becomes the anchor
record the anchor
then from the sim class randomly select
an
image as positive sample we got another
tiger
record the positive sample the positive
sample and the anchor
are from the same class
lastly exclude the tiger class
and randomly sample an image from the
rest of the training set
as a negative sample we happen to get
this elephant
record the negative sample
the negative sample and the anchor are
from different classes
to this end we have an anchor x a
a positive sample x positive
and a negative sample x negative
feed the three images to the
convolutional new network
f although the name is siamese network
there is only one convolutional new
network
the three apps are the same neural
network
the convolutional new network extracts
three feature vectors
from the three images
calculate the distance between the
positive sample and the anchor
in the vector space let d positive
be the square distance a take the
squared out two norm
of f x positive minus f x a
we do the same to find the distance
between the negative sample
and the anchor in the facial space
let the negative be the squared l2 norm
of f x a minus f x negative
we hope the learned network f has such a
property
the feature vectors from the same class
are nearby
while feature vectors from different
classes can be well separated
thus the positive should be small
because the positive sample
and the anchor belong to the same class
the negative should be large because the
negative samples
and the anchor are from different
classes
i want to reiterate the relation among
the three samples
this is the feature space the
convolutional new network
maps images to feature vectors
this is the anchor it is a tiger image
its feature vector is a red dot
this is a positive sample it's another
tiger
its feature vector is the green dot
the square distance between the two
facial vectors
is d positive we hope the positive
is as small as possible
the elephant is a negative sample it's
from a different class
its feature vector is the blue dot the
blue dot
is the facial vector extracted by the
convolutional neural network
let the negative be the square distance
between the blue and red feature vectors
it means how different the negative
sample is from the anchor
we hope the negative is as large as
possible
the negative should be much larger than
the positive
otherwise the model cannot distinguish
between tiger and elephant
based on the idea we discussed let's
define the loss function
the positive is the squared l2 distance
between the positive sample
and the anchor in the fader space
intuitively speaking the feature vectors
of two tigers
should be close so we should encourage
the positive to be small
the negative is the squared l2 distance
between the negative sample and the
anchor in the feature space
we hope the feature factors of different
classes are
far apart the facial factor
of an elephant should be far from tigers
we thereby encourage the distance the
negative
to be big
we can define a margin alpha
alpha is positive it is a tuning
hyperparameter
ideally the negative is big and the
positive is small
in our example the negative is the
distance between an elephant and a tiger
while the positive is the distance
between two tigers
the negative should be much larger than
the positive
if the negative is greater than the
positive by a margin of alpha
then we believe the classification is
correct
and the loss is zero
if the condition is not satisfied which
means
the negative is not sufficiently larger
than the positive
then we believe this is a failure the
model cannot tell the difference between
an elephant and a tiger
there should be a loss let the loss
be the positive plus alpha minus the
negative
we encourage the loss to be small a
small loss
means the positive is small so the two
tigers are close in the feature space
also a small loss means the negative is
big
so the tiger and elephant can be well
separated
in sum we define such a loss function
if the positive plus alpha minus d
negative
is greater than zero then it is a loss
it means we cannot distinguish between
an elephant and a tiger
otherwise if the positive plus alpha
minus the negative is less than zero
which means the classification is right
then there is no loss
the loss is simply zero
such a loss function is called the
triplet loss
it is based on a triplet of samples the
anchor
the positive sample and the negative
sample
with the loss function at hand we can
take the derivative of the loss with
respect to the model parameters
and then perform gradient descent to
update the model parameters
after an update the elephant and the
tiger
will be further apart in the fissure
space whereas the tigers will be closer
in the feature space
after training the side miss network we
can use a network for one shot
prediction
we are given a support set the classes
in the support set
are not contained in the training set
we are given a query image it belongs to
one of the six
classes in the support set we want to
classify the query image
we compare the query image with the
images in the support set
in this way we use the convolutional new
network
to extract features from all the images
then compute the distance between the
feature vectors
for example the query and the fox have a
distance of 231 in the feature space
the query image and squirrel have a
distance of 19 in the feature space
do the same to compute all the distances
then find the smallest distance
the distance between the query image and
the square class
is 19. 19 is the smallest
among all the distances the model
believes
the query image is most similar to the
square class
the model thereby predict that the query
image
is a squirrel
in this lecture we learn the same means
networks for solving future
learning let me summarize this lecture
here is the basic idea of solving future
learning
we first train a siamese network on a
large-scale training set
the scientist network learns the
similarity and difference between things
after training we can use the siamese
network to make predictions
what makes visual learning different
from standard supervised
learning is that the queries class does
not appear in the training set for
example
the query is a squirrel but the training
set does not have a square class
thus to recognize a query we must
provide additional information
the additional information is a support
set
the support set is called k-way unshot
k-way means the support set has k
classes
the more classes there are the harder
the prediction
is n-shot means each class has end
samples
the fewer samples there are the harder
the prediction is
the hardest problem is one shot learning
that is making a prediction based on
only one sample
with the trinity means network at hand
we can compare the query with every
sample in the support set
to find similarity scores then
use a sample in the support set that has
the highest
similarity score as a prediction
i have elaborated on two ways of
training the siamese network
one is using the sine mix network to
predict the pairwise similarity score
each time we select a pair of two images
as the input of the siamese network
the images are transformed by
convolutional layers and the dense
layers
the output is the similarity score
between 0 and 1.
1 means the two inputs are from the same
class
0 means the two inputs are different
the target is either 0 or 1
if the two inputs are from the same
class then the target is one
otherwise the target is zero
define the loss function as the
difference between the prediction
and the target the goal of training
is to minimize the loss equivalently
making the prediction closer to the
target
in this way the learned neural network
can predict the similarity
between the two inputs
the other way of training the same
network is to use the triplet loss
each time select three images as inputs
they are the anchor x a the positive
sample
x positive and the negative sample x
negative
then use a convolutional new network to
extract features from the inputs
we obtain three feature vectors
let the positive be the square distance
between the positive sample
and the anchor in the feature space
let the negative be the squared distance
between the negative sample
and the anchor
the objective of training is to make the
positive
as small as possible that is to make the
two tigers close in the feature space
also make the negative as big as
possible
that is to make the tiger far from the
elephant in the feature space with the
trained network at hand
we can compare the query image and the
labeled image in the feature space
prediction is made based on the
distances in the feature space
i have finished teaching scientist
networks
thank you for watching this video the
link to my slides can be found below the
video
in the next lecture i will introduce
preaching and fine tuning for future
learning
5.0 / 5 (0 votes)