Self-supervised learning and pseudo-labelling

Samuel Albanie
27 Apr 202224:24

Summary

TLDRThis video delves into self-supervised learning and pseudo-labeling, key concepts in machine perception. It highlights the limitations of supervised learning and explores how self-supervised learning, inspired by human multimodal learning, can improve model performance. The video discusses various pretext tasks and the instance discrimination approach, which leverages visual similarity for representation learning. Additionally, it covers pseudo-labeling in semi-supervised learning, demonstrating its effectiveness in tasks like word sense disambiguation and image classification, showcasing its potential in handling large unannotated datasets.

Takeaways

  • ๐Ÿค– **Self-Supervised Learning Motivation**: The need for self-supervised learning arises from the limitations of supervised learning, where large annotated datasets are required, and models still make mistakes that humans wouldn't.
  • ๐Ÿ‘ถ **Inspiration from Human Development**: Human babies learn in an incremental, multi-modal, and exploratory manner, which inspires self-supervised learning methods that mimic these natural learning strategies.
  • ๐Ÿ”„ **Redundancy in Sensory Signals**: Self-supervised learning leverages redundancy in sensory signals, where the redundant part of the signal that can be predicted from the rest serves as a label for training a predictive model.
  • ๐Ÿง  **Helmholtz's Insight**: The concept of self-supervision is rooted in Helmholtz's idea that our interactions with the world can be thought of as experiments to test our understanding of the invariant relations of phenomena.
  • ๐Ÿฎ **Self-Supervised Learning Defined**: It involves creating supervision from the learner's own experience, such as predicting outcomes based on movements or changes in the environment.
  • ๐Ÿ”€ **Barlow's Coding**: Barlow's work suggests that learning pairwise associations can be simplified by representing events in a way that makes them statistically independent, reducing the storage needed for prior event probabilities.
  • ๐Ÿ“ˆ **Pseudo-Labeling**: Pseudo-labeling is a semi-supervised learning technique where a model predicts labels for unlabeled data, which are then used to retrain the model, often leading to improved performance.
  • ๐ŸŽฒ **Pretext Tasks**: In computer vision, pretext tasks like image patch prediction or jigsaw puzzles are used to train models without explicit labeling, hoping they learn useful representations of the visual world.
  • ๐Ÿ”„ **Instance Discrimination**: A powerful self-supervised learning approach that trains models to discriminate between individual instances, leading to better visual similarity capture compared to semantic class discrimination.
  • ๐Ÿ”ง **Practical Challenges**: There are practical challenges in creating effective pretext tasks, such as the risk of models learning to 'cheat' by exploiting unintended cues rather than understanding the task's underlying concepts.

Q & A

  • What are the two main topics discussed in the video?

    -The two main topics discussed in the video are self-supervised learning and pseudo-labeling.

  • What is the motivation behind self-supervised learning?

    -The motivation behind self-supervised learning is to improve machine perception by taking inspiration from early stages of human development, particularly the way humans learn in a multi-modal and incremental manner.

  • How does self-supervised learning differ from supervised learning?

    -In self-supervised learning, the learner creates its own supervision by exploiting redundant signals from the environment, whereas in supervised learning, a model is trained using manually annotated data.

  • What is the role of redundancy in sensory signals in self-supervised learning?

    -Redundancy in sensory signals provides labels for training a predictive model by allowing the learner to predict one part of the signal from another, thus creating a learning target without external supervision.

  • What is a pretext task in the context of self-supervised learning?

    -A pretext task is a task that is not the final goal but is used to learn useful representations of the data. It is often a game-like challenge that the model must solve, which in turn helps it learn about the visual world.

  • What is pseudo-labeling and how does it work?

    -Pseudo-labeling is a semi-supervised learning algorithm where a classifier is first trained on labeled data, then used to predict labels for unlabeled data. These predicted labels, or pseudo-labels, are then used to retrain the classifier, often iteratively.

  • Why is pseudo-labeling effective when large quantities of data are available?

    -Pseudo-labeling is effective with large data sets because it leverages the unlabeled data to improve the classifier's predictions, leading to better performance, especially when combined with techniques like data augmentation.

  • What is the 'noisy student' approach mentioned in the video?

    -The 'noisy student' approach is a method where an initial model trained on labeled data infers pseudo-labels for unlabeled data, and then a higher capacity model is trained on these pseudo-labeled data, often with heavy use of data augmentation.

  • How does self-supervised learning relate to human perception development?

    -Self-supervised learning relates to human perception development by mimicking the way humans learn from their environment through exploration and interaction, without the need for explicit labeling.

  • What challenges are there in creating effective pretext tasks for self-supervised learning?

    -Creating effective pretext tasks can be challenging because they need to be designed carefully to ensure the model learns useful representations rather than exploiting unintended shortcuts or low-level signals.

Outlines

00:00

๐Ÿค– Introduction to Self-Supervised Learning and Pseudo-Labeling

This paragraph introduces the video's focus on self-supervised learning and pseudo-labeling, topics derived from lectures given at the University of Cambridge. The speaker outlines the structure of the video, starting with self-supervised learning and then moving to pseudo-labeling. The motivation for self-supervised learning is discussed in the context of machine perception, highlighting the limitations of supervised learning despite its successes. The speaker points out that even high-capacity models make mistakes and suggests looking to human developmental stages for inspiration. The learning strategies of human babies are examined, emphasizing incremental, social, physical, exploratory, language-based, and multi-modal learning. The challenges of implementing these strategies in machine learning are acknowledged, especially the practical barriers to embodied learning. The paragraph concludes by setting the stage for a discussion on multi-modal learning and self-supervised methods inspired by human learning.

05:01

๐Ÿ” The Concept of Self-Supervised Learning

The paragraph delves into the concept of self-supervised learning, which involves creating one's own supervision. It references historical insights by Helmholtz and Barlow on perception and redundancy in sensory signals. The importance of redundancy for learning new associations is emphasized, as it provides a predictable signal from which to learn. The discussion then turns to computational tricks, such as using minimum entropy coding to avoid a combinatorial explosion of storage for prior event probabilities. The paragraph also covers the evolution of self-supervised learning, from early works by Dasar to modern approaches like instance discrimination and momentum contrast. The latter, in particular, addresses the challenge of maintaining an up-to-date memory bank for effective learning. The paragraph illustrates how self-supervised learning leverages the redundancy in multimodal signals to train predictive models without external annotations.

10:02

๐ŸŽฒ Pretext Tasks in Self-Supervised Learning

This section explores pretext tasks used in self-supervised learning for computer vision. Pretext tasks are games or challenges that models must solve to learn about the visual world without explicit labeling. Examples include predicting the relative position of image patches, in-painting, solving jigsaw puzzles, colorization, and counting objects. The paragraph warns of the potential for models to 'cheat' by exploiting low-level signals rather than learning the intended high-level features, as seen in a study where models used chromatic aberration to solve tasks. The importance of carefully constructing pretext tasks is highlighted, as they can significantly influence the learning process. The paragraph also mentions various creative pretext tasks that have been developed for training deep neural networks effectively.

15:02

๐Ÿพ Advanced Techniques in Self-Supervised Learning

The paragraph discusses advanced techniques in self-supervised learning, focusing on instance discrimination and its extension, momentum contrast. Instance discrimination trains a model to uniquely encode each image instance, while momentum contrast addresses the issue of stale memory banks by using a queue of recent samples and a momentum encoder. The paragraph also touches on specialized downstream tasks like learning tracking models from colorization in videos and detecting object keypoints without supervision through viewpoint factorization. The innovative approaches demonstrate how self-supervised learning can be adapted to specific tasks beyond general image representation.

20:03

๐Ÿท๏ธ Pseudo-Labeling in Semi-Supervised Learning

The final paragraph shifts focus to pseudo-labeling within the realm of semi-supervised learning. Pseudo-labeling involves training a classifier on labeled data, using it to predict labels for unlabeled data (pseudo-labels), and then retraining the classifier on these pseudo-labels. The process can be iterated for improved performance. The paragraph provides an example of pseudo-labeling in word sense disambiguation and its effectiveness in large-scale image classification using ImageNet and JFT-300M datasets. The 'noisy student' technique is highlighted for its significant gains over traditional training methods. The paragraph concludes by emphasizing the growing value of pseudo-labeling as manual annotation struggles to keep pace with the vast amounts of sensory data being generated.

Mindmap

Keywords

๐Ÿ’กSelf-supervised learning

Self-supervised learning is a machine learning paradigm where the model generates its own labels from input data without human supervision. This concept is central to the video's theme, as it explores methods that allow machines to learn from data in a way that mimics human perception development. An example from the script highlights how self-supervised learning can be achieved by creating a plentiful supply of learning targets through movements and predictions about the environment.

๐Ÿ’กPseudo-labeling

Pseudo-labeling, also known as self-training or self-labeling, is a semi-supervised learning technique where a model trained on labeled data is used to predict labels for unlabeled data, which are then treated as pseudo-labels for further training. The video discusses this method as a powerful tool, especially when large unannotated datasets are available, and illustrates its effectiveness in improving large-scale image classification performance.

๐Ÿ’กMulti-modal learning

Multi-modal learning refers to the process of learning from multiple types of sensory input data. In the context of the video, multi-modal learning is discussed as a human-inspired strategy that machines can adopt to improve their perception capabilities. The video suggests that by exploiting redundant signals across different modalities, machines can learn more robust representations of the world.

๐Ÿ’กBenchmarks

Benchmarks in the video refer to standardized tests or measurements used to evaluate the performance of machine learning models, particularly in computer vision. While benchmarks are valuable for objective comparison, the video points out that they may also lead to a focus on narrow research objectives and that real-world performance is the ultimate goal.

๐Ÿ’กHuman perception system

The human perception system is the collective mechanisms by which humans interpret sensory information from the environment. The video discusses the limitations of current machine learning models in achieving human-like perception, suggesting that taking inspiration from early human developmental stages could improve machine perception.

๐Ÿ’กRedundancy in sensory signals

Redundancy in sensory signals is the presence of repeated or overlapping information in the data received by the senses. The video explains that redundancy is crucial for learning new associations, as it provides the necessary prior knowledge to detect patterns and make predictions about the environment, which is a key aspect of self-supervised learning.

๐Ÿ’กPretext task

A pretext task is a secondary problem that a model is trained to solve, with the expectation that solving this task will lead to the learning of useful representations for the actual target task. The video provides several examples, such as predicting the relative position of image patches, which serve as a pretext for learning about objects and their relationships in the visual world.

๐Ÿ’กInstance discrimination

Instance discrimination is a method in self-supervised learning where models are trained to distinguish between individual instances of data, even within the same class. The video explains that this approach can lead to learning powerful image representations, as it encourages the model to capture fine-grained visual details that differentiate similar-looking objects.

๐Ÿ’กMomentum contrast (MoCo)

Momentum contrast is an extension of instance discrimination that addresses the challenges of maintaining an updated memory bank for all instances. The video describes how MoCo uses a queue of recently encoded samples and a momentum encoder to provide a more efficient and effective approach to instance discrimination, which in turn improves the model's ability to learn useful representations.

๐Ÿ’กContextual redundancy

Contextual redundancy refers to the redundancy found in multimodal signals that comes from the context surrounding a piece of information. The video discusses how contextual redundancy can be leveraged for self-supervised learning, using examples from natural language processing where unlabeled text corpora provide low-level supervision through the context in which words appear.

Highlights

Self-supervised learning and pseudo-labeling are key areas of focus in advancing machine perception.

Supervised learning has made significant progress in machine perception but still has limitations.

Human perception development provides inspiration for improving machine learning.

Human babies learn incrementally in a dynamic environment with multi-modal experiences.

The challenge of embodied learning is highlighted by Alan Turing's observations from 1948.

Self-supervised learning aims to create its own supervision, inspired by human multi-modal learning.

Redundancy in sensory signals is crucial for learning, as noted by Barlow.

Predictive models can be trained using redundant signals to provide labels for learning.

Self-supervised learning can be operationalized by learning from redundant signals across modalities.

Modern self-supervised learning methods often involve pretext tasks that provide a context for learning.

Pseudo-labeling is a semi-supervised learning technique that uses predictions to train on unlabeled data.

Instance discrimination is a powerful mechanism for self-supervised learning.

Momentum Contrast (MoCo) is an extension to instance discrimination that addresses memory bank issues.

Self-supervised learning has been applied to specialized tasks like tracking and keypoint detection.

Pseudo-labeling has been successfully applied to large-scale image classification.

The effectiveness of pseudo-labeling is underscored by its ability to leverage unannotated data.

The future of pseudo-labeling is promising as manual annotation struggles to keep up with data creation.

Transcripts

play00:00

good day everyone

play00:03

in this video i will aim to provide a

play00:05

brief digest of material on

play00:08

self-supervised learning and

play00:09

pseudo-labeling

play00:11

this material forms part of some

play00:13

lectures i gave in 2021 as part of the

play00:16

four f-12 lecture series at the

play00:19

university of cambridge

play00:22

to provide a brief outline for the video

play00:24

we will start out with self-supervised

play00:27

learning before moving on to

play00:29

pseudo-labeling

play00:31

we'll start out with the first topic

play00:35

let's begin with the motivation for

play00:37

self-supervised learning starting with a

play00:40

summary of the state of the nation for

play00:42

the world of machine perception

play00:44

we have a number of reasons to be

play00:46

cheerful

play00:48

deep learning has given us remarkable

play00:50

progress with the supervised learning

play00:52

paradigm in which we gather large

play00:54

collection of data and manually annotate

play00:57

it and supervise a model with the

play00:59

resulting annotated data

play01:02

this has yielded truly major gains on

play01:04

vision benchmarks

play01:06

of course

play01:07

benchmarks are not the goal in

play01:09

themselves and they have drawbacks they

play01:12

can trap you into a local minimum of

play01:14

research ideas

play01:15

but they have clear value in allowing an

play01:18

objective comparison of different

play01:19

methods

play01:21

still

play01:22

we have some cause for concern

play01:24

and somehow it seems that we might still

play01:27

have a long way to go

play01:29

even the highest capacity models trained

play01:32

on the largest annotated data sets

play01:34

continue to make what we might call

play01:36

silly mistakes mistakes that would be

play01:39

never made by a human

play01:41

perhaps more worryingly it seems that we

play01:44

can just never get enough label data to

play01:46

get close to the human perception system

play01:49

this state of affairs prompts a natural

play01:51

age-old question

play01:53

can we take inspiration from the early

play01:56

stages of development of human

play01:58

perception to improve things

play02:00

to answer this it's worth considering

play02:03

the wealth of research that has studied

play02:05

human development in some detail

play02:08

human baby learning is incremental a

play02:11

child learns in a continuously evolving

play02:13

environment rather than a stationary

play02:15

distribution

play02:16

social

play02:17

babies learn from other humans around

play02:19

them particularly caregivers

play02:21

physical

play02:22

they offload knowledge to the physical

play02:24

world around them and store information

play02:26

with respect to their surroundings

play02:28

exploratory once their curiosity is

play02:31

aroused they try a lot of things to try

play02:33

to find something that works

play02:35

learning is language based allowing them

play02:38

to not only communicate but also learn

play02:41

abstractions that support generalization

play02:44

finally their learning is highly

play02:46

multi-modal experiencing sensations from

play02:49

sight sound touch taste proprioception

play02:53

balance and smell when these senses are

play02:56

available

play02:57

different modalities provide significant

play02:59

redundancy among their inputs to learn

play03:01

from

play03:02

it's also interesting that babies

play03:05

naturally build curricula for their

play03:07

learning

play03:08

careful analysis of the appearance of

play03:10

objects observed by babies shows that

play03:12

they follow a strong power law

play03:15

a small number of objects are seen an

play03:18

incredibly large number of times

play03:20

such that the child becomes an expert in

play03:22

recognizing and manipulating those

play03:24

objects

play03:26

okay this all sounds great and many

play03:28

people have observed that our current

play03:30

machine perception systems are doing

play03:32

very little of these human inspired

play03:34

learning strategies

play03:36

why is that

play03:38

i would say that the main barrier to

play03:40

implementing the kind of embodied

play03:41

perception learning exhibited

play03:43

in babies has been practical

play03:47

in fact this challenge was noticed by

play03:49

alan turing who was interested in the

play03:51

development of such machines and

play03:53

observed in 1948 that

play03:56

in order that the machine should have a

play03:57

chance of finding things out for itself

play04:00

it should be allowed to roam the

play04:01

countryside and the danger to the

play04:03

ordinary citizen would be serious

play04:06

the takeaway here is that there is some

play04:08

practical challenge to

play04:10

embodied learning which is the general

play04:13

term given to an agent possessing a body

play04:15

that learns to interact with its

play04:17

environment

play04:18

it's possible that recent developments

play04:21

in simulation may help us here but we're

play04:23

still some way away from creating the

play04:26

world in sufficient fidelity to have

play04:28

fully addressed this problem

play04:30

for that reason we will talk about a

play04:32

family of methods that tries to make

play04:34

progress principally on perhaps the

play04:36

safest of the human learning

play04:38

characteristics multi-modal learning

play04:41

in particular we will discuss

play04:43

self-supervised methods that are partly

play04:45

inspired by human multimodal learning in

play04:48

the sense of exploiting redundant signal

play04:52

the essence of self-supervised learning

play04:54

as you might infer from the name is that

play04:57

the learner should create its own

play04:58

supervision

play05:00

this is an old idea that was nicely

play05:03

articulated by helmholtz in his

play05:05

legendary 1878 speech on the facts in

play05:08

perception each movement we make by

play05:11

which we alter the appearance of objects

play05:13

should be thought of as an experiment

play05:15

designed to test whether we have

play05:17

understood correctly the invariant

play05:19

relations of the phenomena before us

play05:21

that is their existence in definite

play05:24

spatial dimensions

play05:26

so we can create a plentiful supply of

play05:28

learning targets by simply moving and

play05:31

checking whether our model of the world

play05:33

was able to predict what we will see

play05:36

the role of redundancy in sensory

play05:38

signals was then considered more

play05:40

explicitly by barlow who made the

play05:42

observation that learning requires

play05:44

previous knowledge

play05:46

in particular to detect a new

play05:48

association such as the event c

play05:51

preceding event u

play05:53

we need to know the prior probabilities

play05:56

of c and u

play05:58

if we have those then we can learn a new

play06:00

association if we observe that c is

play06:03

followed by u more frequently than would

play06:05

be expected by chance

play06:07

the key role of redundancy is that in

play06:10

order to know what usually happens we

play06:12

need redundancy in the input signal

play06:14

which could be for example sensory

play06:16

messages of the same event from

play06:18

different modalities

play06:20

the redundant signal is by definition

play06:23

the part of the signal that can be

play06:24

predicted from the remaining signal

play06:27

this redundant signal provides labels

play06:30

for training a predictive model

play06:32

one computational trick that is nicely

play06:35

illustrated in barlow's work relates to

play06:38

the kinds of codes we can use to

play06:39

represent events in our environment

play06:42

the observation is the following

play06:44

to test whether two events are

play06:46

co-occurring more frequently than would

play06:47

happen by chance we need to know their

play06:49

prior joint probabilities

play06:52

so when learning pairwise associations

play06:54

between n events we need to store

play06:56

n-squared co-occurrence probabilities

play06:59

but if we've learned representations in

play07:01

which events c and u are statistically

play07:03

independent we can compute the chance

play07:06

co-occurrence of c and u from the

play07:08

product of their marginals that means we

play07:10

only need to store n event probabilities

play07:13

this can help to avoid a combinatorial

play07:16

explosion of storage for prior event

play07:18

probabilities

play07:20

barlow himself suggested minimum entropy

play07:23

coding as a scheme to obtain factorial

play07:26

representations

play07:27

but this idea applies more generally

play07:30

it's always desirable to achieve this

play07:32

property where possible

play07:35

one of the first works to use the term

play07:37

self-supervised that operationalized

play07:40

this insight about learning from

play07:41

redundant signal was proposed by

play07:44

virginia dasar in the 1990s

play07:46

who helpfully provided an intuitive

play07:49

bovine based explanation

play07:52

in supervised learning each time we see

play07:55

an image of a cow as input

play07:57

we are providing the corresponding cow

play07:59

label

play08:00

but it is implausible to collect all

play08:03

labels required for such a task

play08:06

unsupervised learning takes in the same

play08:08

cow image and seeks to learn a powerful

play08:11

representation of it without labels

play08:14

in the self-supervised approach proposed

play08:16

here

play08:17

labels are derived from co-occurring

play08:20

inputs across modalities providing

play08:22

redundant signal

play08:24

learning then proceeds by minimizing the

play08:26

disagreement between class labels

play08:28

predicted from each modality rather than

play08:30

matching predictions against an external

play08:33

annotation set

play08:35

this is an idea that underlies a number

play08:37

of modern approaches

play08:39

it's also worth noting that in the

play08:41

modern literature the distinction

play08:43

between self-supervised and unsupervised

play08:45

learning as given here can become a

play08:48

little blurry

play08:50

a general way to think about the

play08:52

redundancy found in multimodal signals

play08:55

is that the redundancy comes from the

play08:57

context surrounding a piece of

play08:59

information

play09:01

in the natural language processing

play09:03

community

play09:04

unlabeled text corpora have been used

play09:06

for a long time to provide low level

play09:08

supervision for neural networks

play09:10

with the hope that the distributed

play09:12

representations learned by these models

play09:14

will enable them to generalize

play09:17

one example of such models are auto

play09:20

regressive models in which the joint

play09:22

probability of a sequence is factored

play09:24

into a product of conditional

play09:26

distributions

play09:27

with each element in the sequence

play09:29

conditioned on the previous elements

play09:31

then

play09:32

a network can be trained to maximize the

play09:34

likelihood of a text corpus under this

play09:36

factorization

play09:38

for example

play09:40

a network can be trained to predict the

play09:42

next character in text to enable

play09:44

compression

play09:45

or to predict the next word to learn a

play09:47

language model

play09:49

a slightly more adventurous use of

play09:51

context was explored by the word to vec

play09:54

skip graham model which was trained to

play09:57

predict the collection of words

play09:58

surrounding any given word in the corpus

play10:01

this is simple to train without labels

play10:04

you simply pick a word in a sentence

play10:07

mask its neighbors and then try to

play10:09

predict those neighbors from the current

play10:10

word

play10:12

this work was really a breakthrough in

play10:14

terms of language modelling performance

play10:16

and highlighted the critical importance

play10:19

of having lots of training data in

play10:21

getting good word vectors

play10:24

more recently various multitask masking

play10:27

schemes have been considered

play10:30

perhaps the best known is burt which was

play10:32

trained to predict randomly masked words

play10:35

in a sequence in addition to predicting

play10:37

the next sentence in a corpus

play10:40

this work showed amongst other things

play10:43

the benefits of using a high capacity

play10:45

transformer for language modeling

play10:48

it's perhaps a little less obvious how

play10:50

the same approach can be applied to

play10:52

computer vision

play10:54

the strategy that has been developed has

play10:56

been to train a neural network by

play10:58

tasking it with playing a game

play11:00

which is often referred to as a pretext

play11:02

task

play11:04

now typically we don't care about the

play11:07

performance of the model on the pretext

play11:09

task itself but we hope that by solving

play11:11

it a model learns good representations

play11:14

of the visual world

play11:16

work by carl dusch and colleagues

play11:18

illustrates this idea nicely

play11:21

a network is shown two image patches

play11:24

these image patches have been cropped

play11:26

from nearby regions in the same image

play11:29

the pretext task for the model is to

play11:32

guess the relative position of the red

play11:34

dashed cropped to the blue crop

play11:37

here's an example

play11:38

ask yourself where does the red dashed

play11:41

cropped belong to relative to the blue

play11:43

crop

play11:44

here's a second example again where does

play11:47

the red dash crop belong to relative to

play11:49

the blue crop

play11:51

when you worked out that the answer to

play11:54

question one was probably bottom right

play11:56

and question two was probably top middle

play12:00

you made use of your knowledge of buses

play12:02

and trains despite almost certainly

play12:04

having never seen this particular bus or

play12:07

this particular train before

play12:09

the key idea is that a model can only

play12:12

solve these questions once it learns

play12:14

about cats buses and trains and

play12:16

importantly no labeling is required to

play12:19

construct this task

play12:21

this seems like a cute idea and it is

play12:24

but a warning is needed

play12:27

sometimes the model won't solve the task

play12:29

in the way that you want it

play12:31

in this work dorse et al found that the

play12:33

network can learn to cheat by exploiting

play12:35

a low level signal the chromatic

play12:38

aberration that results from a camera

play12:39

lens focusing different light

play12:41

wavelengths differently

play12:44

one color typically green is shrunk

play12:47

towards the image center relative to the

play12:49

others

play12:50

once the network has figured this out

play12:53

it can solve the problem trivially by

play12:55

determining the absolute locations of

play12:57

the patches relative to the lens without

play12:59

learning anything at all about cats

play13:01

buses trains or the host of other

play13:04

interesting objects we'd like it to know

play13:05

about

play13:07

the authors solved this problem by

play13:09

randomly dropping colour channels from

play13:11

each patch so that the network could not

play13:13

rely on this queue

play13:14

but still it provides a cautionary tale

play13:17

constructing pretext tasks requires a

play13:20

great deal of care

play13:22

partly because it works so well and

play13:24

partly because it's a fun research

play13:26

problem

play13:27

researchers have come up with a number

play13:29

of creative pretext tasks for training

play13:31

deep neural

play13:32

networks some examples include training

play13:36

a network to in-paint by removing

play13:38

patches of images and requiring the

play13:39

model to fill them in

play13:41

requiring the network to solve jigsaw

play13:44

puzzles by giving it a shuffled set of

play13:46

patches and asking it to rearrange them

play13:49

to match an image

play13:51

colorization in which a model is given

play13:53

an image whose colour has been removed

play13:56

and then tasked with predicting the

play13:57

original colour version

play14:00

there is training a model to count this

play14:02

is a slightly ingenious idea the network

play14:05

is given image patches and has to count

play14:08

objects in those patches in such a way

play14:10

that when applied to the full image the

play14:13

total object counts match the sum of the

play14:15

object counts across each patch

play14:18

other work has trained a model to invert

play14:21

again by training a second gan to

play14:23

generate the latent codes of the

play14:24

original gan

play14:26

one particularly appealing pretext task

play14:29

is to train a model to group pixels

play14:31

according to optical flow

play14:34

the idea here is to train a model that

play14:36

learns pixel embeddings that are similar

play14:39

if and only if those pixels tend to move

play14:42

with the same velocity in videos which

play14:44

is a way of encoding the prior knowledge

play14:46

that the pixels on a common object tend

play14:49

to move together

play14:51

clustering has also proven to be highly

play14:53

effective here a network alternates

play14:56

between clustering its own feature space

play14:58

and classifying the cluster membership

play15:00

of each image

play15:02

finally rotation prediction can be used

play15:05

to exploit human photographer bias

play15:08

we tend to photograph things the right

play15:10

way up so we can rotate images and ask

play15:12

the model to predict the rotation that

play15:14

has been applied

play15:16

this is also surprisingly powerful

play15:20

i'd like to talk in a little more detail

play15:22

about a task formulation that has

play15:25

emerged as one of the most powerful

play15:27

mechanisms for self-supervised learning

play15:30

namely instance discrimination

play15:33

one motivation for considering this idea

play15:36

stems from the observation that despite

play15:38

training with semantic labels

play15:41

fully supervised convolutional neural

play15:43

networks also appear to capture visual

play15:45

similarity between instances

play15:48

given an input image of a leopard

play15:51

the classification scores of a fully

play15:53

supervised model are strongest for the

play15:56

class of leopard

play15:58

but also for highly visually similar

play16:00

classes like jaguar and cheetah and much

play16:03

less so for glasses like lifeboat shop

play16:06

cart and bookcase

play16:09

an interesting question is then whether

play16:10

the same property would emerge if we

play16:13

trained a model to discriminate between

play16:15

individual instances rather than

play16:17

semantic classes

play16:19

one work from wu at al explored this

play16:21

idea

play16:22

it trained a cnn to map all image

play16:24

instances of a data set

play16:26

to 128 dimensional vectors storing them

play16:30

in a data structure called a memory bank

play16:33

the model was trained to encourage each

play16:35

instance to be mapped to a different

play16:37

location on a 128 dimensional unit

play16:40

sphere such that each image when encoded

play16:43

with the cnn would uniquely retrieve its

play16:46

memory bank vector

play16:48

this model can be trained without labels

play16:51

but empirically nevertheless learns

play16:52

strong image representations

play16:56

momentum contrast considered an

play16:58

extension to this approach

play17:00

the motivation behind this work was that

play17:03

instance discrimination works well but

play17:05

memory banks have an issue

play17:07

on the one hand recomputing the features

play17:10

stored in the bank ie one feature per

play17:12

image in the data set for every update

play17:15

to the cnn parameters would be

play17:17

prohibitively expensive

play17:19

on the other if memory bank instances

play17:21

are not regularly updated they grow

play17:24

increasingly stale with every

play17:25

optimization step

play17:27

which is sub-optimal for instance

play17:29

discrimination

play17:31

moco or momentum contrast aims to avoid

play17:34

this stay on this issue by first

play17:36

replacing the memory bank with a queue

play17:38

of recently encoded samples fewer than

play17:41

the full data set

play17:42

and encoding queue samples with a

play17:44

momentum encoder which is formed from a

play17:46

slow moving average of query encoder

play17:48

weights

play17:50

moco uses some additional terminology

play17:53

keys refer to instances encoded in the

play17:55

queue with the momentum encoder

play17:57

queries are instances to be compared

play18:00

against keys

play18:01

positive pairs are queries and keys

play18:04

originating from the same image and by

play18:06

extension negative pairs are queries and

play18:09

keys originating from different images

play18:12

the instance discrimination task is then

play18:15

to match up queries against the keys

play18:17

that represent their positive pairs

play18:19

using what's known as an info nce loss

play18:22

which is essentially a standard

play18:23

cross-entropy softmax

play18:26

finally the resulting query encoder then

play18:28

provides a useful representation for

play18:31

downstream tasks

play18:33

the methods we've discussed so far have

play18:35

focused principally on learning general

play18:37

image representations

play18:39

however the ideas behind self-supervised

play18:42

learning have also been applied to

play18:44

tackling more specialized downstream

play18:45

tasks

play18:47

one lovely piece of work by vondrick and

play18:49

collaborators showed how to learn a

play18:51

tracking model by performing

play18:53

colorization

play18:55

the key idea is to use colors across

play18:57

unlabeled videos as a source of

play18:59

supervision

play19:00

the model is given a black and white

play19:02

frame that it needs to colorize and a

play19:04

reference black and white frame

play19:06

at each location in the input frame it

play19:08

is tasked with pointing to the location

play19:10

in the reference frame that contains the

play19:12

same color as the current location

play19:15

this color is copied across to the input

play19:17

frame from the color version of the

play19:19

reference frame so that a loss can be

play19:21

applied at the same place in the true

play19:23

colors of the current frame

play19:25

in practice this can be implemented by

play19:28

training a cnn that produces a low

play19:30

dimensional embedding at each location

play19:32

of an image then performing pointing

play19:34

from the target frame to the reference

play19:36

frame by simply comparing the

play19:38

similarities of the embeddings at each

play19:40

location in each frame

play19:42

given enough data the model indeed

play19:44

learns to solve this task and the

play19:46

authors show that without any labels it

play19:49

gains the ability to track objects

play19:51

across frames

play19:53

another direction has considered

play19:55

learning object key points without

play19:57

supervision

play19:58

the idea shown here for cat faces was to

play20:02

learn a model that would learn to detect

play20:04

consistent locations on an object

play20:06

without any labels

play20:08

the approach here was to use a concept

play20:10

called viewpoint factorization

play20:13

the essence of this idea is if you had a

play20:15

good key point detector it should fire

play20:18

on the same point of the cat's face even

play20:20

as the cat's face moves around

play20:23

in practice that can be encouraged by

play20:25

enforcing what's known as equivariance

play20:28

as the image is translated the key point

play20:30

is also translated by the same amount

play20:33

since we have no annotations just images

play20:36

of cat faces these translations and

play20:38

other geometric transformations are

play20:41

generated via synthetic image warps and

play20:43

a loss is used to ensure that the key

play20:45

points move consistently with the warps

play20:48

another direction has sought to learn

play20:51

powerful video representations with a

play20:53

simple idea

play20:55

a model takes in several video clips one

play20:58

of which has had its frames shuffled

play21:01

training a model to predict which clip

play21:03

was shuffled learns representations that

play21:05

are particularly useful for action

play21:07

recognition

play21:09

we now turn to sudo labeling

play21:13

i'd like to talk briefly about

play21:14

semi-supervised learning and the

play21:16

pseudo-labeling algorithm

play21:19

semi-supervised learning considers the

play21:21

setting in which a learner assumes

play21:22

access to both labeled and unlabeled

play21:25

data during training

play21:27

typically to provide benefit the

play21:29

unlabeled data is assumed to be

play21:31

significantly larger than the label data

play21:34

pseudo-labeling which is sometimes also

play21:37

referred to as self-training or

play21:39

self-labeling is a term that refers to

play21:42

some variation of the following

play21:44

algorithm

play21:45

first a classifier is trained on the

play21:48

labeled data

play21:49

next the classifier is used to predict

play21:52

the labels of the unlabeled data these

play21:54

labels are referred to as pseudolabels

play21:58

the classifier is then retrained on the

play22:00

pseudo labels

play22:01

often this process is iterated by

play22:04

regenerating a new set of pseudo labels

play22:07

retraining generating new pseudo labels

play22:10

etc a nice illustration of this

play22:13

algorithm is provided by the work of

play22:15

yarovsky

play22:16

who focused on the task of word sense

play22:18

disambiguation across a full corpus

play22:22

here the task is to determine the sense

play22:24

in which a word is meant

play22:26

for example the word plant may refer to

play22:29

a manufacturing plant or a live plant

play22:33

the algorithm proceeds by obtaining an

play22:36

initial small collection of labelled

play22:37

samples

play22:38

and then using them to train a

play22:40

classifier

play22:41

then it predicts labels for unlabeled

play22:44

sequences

play22:45

keeping those that have high confidence

play22:47

and optionally filtering and expanding

play22:49

the labeled sets via some nlp heuristics

play22:53

this stage is then repeated until

play22:55

convergence to a final state

play22:58

the reason for mentioning this algorithm

play23:00

is twofold

play23:01

one it's a pleasingly simple algorithm

play23:04

two

play23:05

when large quantities of data are

play23:07

available it works remarkably well in a

play23:11

wide range of settings

play23:13

david yarovsky awkwardville's work noted

play23:16

that it thrives on unannotated

play23:19

monolingual corpora the more the merrier

play23:24

as an example application in computer

play23:27

vision pseudo-labeling was applied to

play23:30

improving large-scale image

play23:32

classification performance by taking

play23:34

imagenet as a source of labelled images

play23:37

and jft 300 million as a source of

play23:40

unlabeled images

play23:42

the method referred to as noisy student

play23:45

trained an initial model on the labelled

play23:47

data inferred pseudolabels and then

play23:50

retrained higher capacity model on the

play23:52

pseudo-labeled data

play23:54

making heavy use of data augmentation

play23:56

before repeating

play23:59

noisy student led to significant gains

play24:01

over image net only training

play24:03

highlighting the effectiveness of this

play24:05

technique

play24:06

more broadly i think that pseudo

play24:08

labeling algorithms are likely to prove

play24:11

increasingly valuable in future as

play24:13

manual annotation simply cannot keep up

play24:15

with the scale of sensory data sets that

play24:17

are now being created

play24:19

we've reached the end

play24:21

thank you for your attention

Rate This
โ˜…
โ˜…
โ˜…
โ˜…
โ˜…

5.0 / 5 (0 votes)

Related Tags
Self-Supervised LearningPseudo-LabelingAI PerceptionMachine LearningDeep LearningHuman DevelopmentMultimodal LearningPretext TasksInstance DiscriminationSemi-Supervised Learning