MIT 6.S191 (2018): Issues in Image Classification

Alexander Amini
8 Feb 201817:18

Summary

TLDRThe speaker discusses the advancements in deep learning for image classification, highlighting the impressive error rate reductions in recent years. However, they point out the limitations of current models, using examples from the Open Images dataset, which may not accurately classify images due to biases in the training data. The talk emphasizes the importance of considering the differences between training and inference distributions and the potential societal impact of machine learning models. The speaker concludes by advocating for awareness of these issues and provides resources on machine learning fairness.

Takeaways

  • 📈 The error rate in image classification on ImageNet has significantly decreased over the years, with contemporary results showing impressive accuracy.
  • 🌐 The Open Images dataset, with 9 million images and 6,000 labels, is a more complex and diverse dataset compared to ImageNet.
  • 🤖 Deep learning models may sometimes fail to recognize human presence in images, indicating that image classification is not entirely solved.
  • 🔍 The concept of stereotypes in machine learning can be related to labels based on experience from the training set, which may not always be causally related to the outcome.
  • 🌍 Geo-diversity in datasets is crucial; Open Images dataset is predominantly from North America and Europe, lacking representation from other regions.
  • 🔄 The assumption in supervised learning that training and test distributions are identical is often not true in real-world applications, which can lead to biased models.
  • 📊 It's important to consider the societal impact of machine learning models, especially when they are based on societally correlated features.
  • 📚 Understanding and addressing machine learning fairness issues is becoming increasingly important as these models become more prevalent in everyday life.
  • 🔗 Additional resources on machine learning fairness are available, including papers and interactive exercises to explore the topic further.
  • 💡 The speaker emphasizes the importance of not just focusing on improving accuracy but also on the societal implications of machine learning models.

Q & A

  • What is the main topic of the talk?

    -The main topic of the talk is the issues with image classification using deep learning, particularly focusing on the challenges and biases in datasets like ImageNet and Open Images.

  • How has the error rate in image classification changed over the years?

    -The error rate in image classification has significantly reduced over the years, with contemporary results showing an error rate of around 2.2%, which is considered astonishing compared to the 25% error rate in 2011.

  • What is the difference between ImageNet and Open Images datasets?

    -ImageNet has around 1 million images with 1,000 labels, while Open Images has about 9 million images with 6,000 labels, and it supports multi-label classification.

  • Why did the deep learning model fail to recognize a bride in one of the images discussed in the talk?

    -The model failed because it was trained on a dataset that did not represent global diversity well, particularly lacking data from regions like India, leading to a biased model that missed the presence of a human in the image.

  • What is a stereotype in the context of machine learning?

    -In the context of machine learning, a stereotype is a statistical confounder that has a societal basis, which may lead to biased models picking up on correlations that are not causally related.

  • What is the importance of considering the training and inference distributions in machine learning models?

    -Considering the training and inference distributions is crucial to ensure that the model's performance is not just limited to the training data but also generalizes well to new, unseen data in the real world.

  • How does the geolocation diversity in the Open Images dataset compare to global diversity?

    -The Open Images dataset is not representative of global diversity, as the majority of the data comes from North America and a few European countries, with very little data from places like India or China.

  • What is the potential issue with using a feature like shoe type in a machine learning model?

    -Using a feature like shoe type can lead to biased predictions if it is strongly correlated with the target variable in the training data but not necessarily causally related, which may not generalize well to the real world.

  • What is the speaker's advice for individuals interested in machine learning fairness?

    -The speaker advises being aware of differences between training and inference distributions, asking questions about confounders in the data, and not just focusing on improving accuracy but also considering the societal impact of the models.

  • What resources does the speaker recommend for further understanding of machine learning fairness?

    -The speaker recommends a website with a collection of papers on machine learning fairness, as well as interactive exercises on adversarial debiasing to explore the topic further.

Outlines

00:00

🌟 Introduction to Deep Learning and Image Classification

The speaker begins by introducing themselves as being based in the Cambridge office and discusses their work with deep learning at Google Brain. They mention the significant progress in image classification accuracy over the years, particularly on the ImageNet dataset, and the transition from human-level performance to surpassing it. The speaker also introduces the topic of the talk, which is about the challenges and stereotypes in image data sets, and transitions to their colleague Sunshine Kai's discussion on TensorFlow debugger and eager mode.

05:02

🤔 Stereotypes in Machine Learning

The speaker delves into the concept of stereotypes in machine learning, using an interactive exercise to define stereotypes as labels based on experiences within a training set. They discuss the potential issues with relying on features that are correlated but not causal, using a hypothetical dataset about running shoe types and race completion risk as an example. The speaker emphasizes the importance of considering whether features in the training data are truly predictive or just correlated, and the societal implications of stereotypes in machine learning models.

10:02

🌐 Global Diversity in Image Data Sets

The speaker addresses the issue of global diversity in image datasets, specifically the Open Images dataset, which is found to be heavily skewed towards North America and certain European countries. They highlight the importance of the training distribution matching the inference distribution to ensure fair and accurate predictions. The speaker also discusses the societal factors that can lead to confounding in data, such as internet connectivity, and the need for machine learning practitioners to be aware of these issues to avoid perpetuating stereotypes.

15:05

📚 Resources on Machine Learning Fairness

The speaker concludes by providing resources on machine learning fairness, including a newly launched website with papers and exercises on the topic. They mention an adversarial debiasing exercise that aims to prevent a network from picking up unwanted correlations. The speaker encourages the audience to consider the broader impact of their work in machine learning and to be mindful of fairness issues, especially as machine learning models become more integrated into everyday life.

Mindmap

Keywords

💡Deep Learning

Deep learning is a subset of machine learning that involves neural networks with many layers, allowing the computer to learn and make decisions or predictions based on large amounts of data. In the video, the speaker discusses the application of deep learning in image classification tasks, such as those found in the ImageNet dataset.

💡Image Classification

Image classification is the process of assigning labels to images based on their content. The video script mentions the ImageNet challenge, where deep learning models are trained to classify millions of images into thousands of categories. The speaker also discusses the limitations of image classification models when applied to more diverse datasets like Open Images.

💡Stereotypes

Stereotypes, in the context of the video, refer to generalizations or assumptions made based on the data present in the training set of a machine learning model. The speaker argues that stereotypes can lead to models making incorrect assumptions, such as misclassifying images based on irrelevant features that are statistically correlated but not causally related to the classification task.

💡Open Images

Open Images is a large-scale annotated image dataset that contains millions of images with multiple labels, designed to be more diverse than ImageNet. The video script highlights the dataset's limitations due to its lack of geographical diversity, which can lead to biased models that perform poorly on images from underrepresented regions.

💡TensorFlow Debugger

TensorFlow Debugger is a tool used for debugging TensorFlow programs, which are often used for machine learning tasks. The speaker mentions that a colleague will discuss using TensorFlow Debugger and eager mode to make working with TensorFlow easier, although the specifics are not detailed in the script.

💡Inference Time Performance

Inference time performance refers to how well a machine learning model performs when it is used to make predictions on new, unseen data. The speaker emphasizes the importance of ensuring that the training data distribution matches the inference distribution to avoid biases and ensure fair and accurate predictions.

💡Statistical Confounder

A statistical confounder is a variable that is correlated with both the independent variable (features) and the dependent variable (outcome) in a study, potentially leading to incorrect conclusions about causality. In the video, the speaker uses the term to describe features in machine learning models that may be correlated with the desired outcome but are not causally related, which can lead to stereotypes.

💡Fairness in Machine Learning

Fairness in machine learning refers to the development of algorithms and models that do not discriminate against certain groups or individuals. The speaker discusses the importance of being aware of potential biases in datasets and models, and provides resources for further exploration of machine learning fairness.

💡Adversarial D-biasing

Adversarial D-biasing is a technique used to reduce bias in machine learning models by training an additional output head to predict a characteristic that should not influence the model's predictions. The model is penalized if it becomes good at predicting this characteristic, encouraging the network to ignore unwanted correlations and biases.

💡Training and Test Distributions

In machine learning, the training distribution refers to the data used to train the model, while the test distribution is the data used to evaluate the model's performance. The speaker warns against the assumption that these distributions are identical, as real-world applications may require the model to perform well on data that is different from the training set.

Highlights

Deep learning advancements have significantly reduced error rates in image classification.

ImageNet dataset has seen a remarkable improvement in classification accuracy over the years.

Open Images dataset, with 9 million images and 6,000 labels, presents a more complex challenge than ImageNet.

Inception-based models may not always capture human elements in images, raising questions about the effectiveness of image classification.

Stereotypes in machine learning can be based on features that are correlated but not causal.

The assumption in supervised machine learning is that training and test distributions are identical, which may not hold in real-world applications.

The importance of matching training set and inference distribution to ensure fair and effective model performance.

Open Images dataset's geolocation data reveals a lack of global diversity, with a heavy bias towards North America and certain European countries.

The potential societal impact of machine learning models based on societally correlated features.

The concept of stereotypes as statistical confounders with a societal basis.

The need for awareness of differences between training and inference distributions in machine learning applications.

The importance of not just optimizing for accuracy but also considering the societal implications of machine learning models.

Resources for further exploration of machine learning fairness, including papers and interactive exercises.

The use of adversarial techniques to reduce unwanted correlations and biases in deep learning models.

The speaker's emphasis on the need for a balanced approach to machine learning that includes fairness and societal impact considerations.

The speaker's call to action for the audience to engage with resources on machine learning fairness.

Transcripts

play00:02

thanks for having me here yeah so I'm

play00:03

I'm based in the Cambridge office which

play00:06

is like a hundred meters that way um and

play00:09

we do a lot of stuff with deep learning

play00:12

we've got a large group in Google brain

play00:14

and other related fields so hopefully

play00:17

that's interesting to some of you at

play00:19

some point so I'm gonna talk for about

play00:22

20 minutes or so

play00:23

um this sort of image issues in image

play00:25

classification theme I'm gonna hand it

play00:28

over to my excellent colleague sunshine

play00:30

Kai who's going to go through an

play00:32

entirely different subject in using

play00:34

tensor flow debugger and eager mode to

play00:36

make work in tensor flow easier who's

play00:40

maybe that would be good okay so let's

play00:44

let's take a step back so if you guys

play00:46

seen happy graphs like this before go

play00:47

ahead and smile and not if you've seen

play00:49

stuff like this yeah okay so this is a

play00:50

happy graph on image net based image

play00:54

classification so image net is a dataset

play00:56

of million some odd images for this

play01:00

challenge there were a thousand classes

play01:03

and in 2011 back in the dark ages when

play01:08

nobody knew how to do anything the state

play01:10

of the art was something like 25% error

play01:13

rate on this stuff and in the last call

play01:16

it six seven years the reduction in

play01:19

error rate has been kind of astounding

play01:21

to the point where it's now been talked

play01:22

about so much it's like no longer even

play01:24

surprising and it was like yeah yeah we

play01:25

see this human error rate is somewhere

play01:29

between five and ten percent on this

play01:32

task so the contemporary results of you

play01:36

know 2.2 or whatever it is percent error

play01:39

rate are really kind of astonishing and

play01:42

you can look at a graph like this and

play01:44

make reasonable claims that well

play01:47

machines using deep learning are better

play01:50

than humans at image classification on

play01:52

this task that's kind of weird and kind

play01:55

of amazing and maybe we can declare

play01:57

victory and fill audiences full of

play01:59

people clamoring to learn about deep

play02:01

learning that's cool okay so um I'm

play02:06

gonna talk not about image data itself

play02:08

but about a slightly different image

play02:11

data set basically people were like okay

play02:13

obviously image net is too easy let's

play02:15

make a large

play02:15

more interesting data set so open images

play02:17

was released I think a year or two ago

play02:20

it's got about 9 million as opposed to 1

play02:21

million images the the base dataset has

play02:26

6,000 labels as opposed to 1000 labels

play02:29

this is also multi labels so you get you

play02:31

know if there's a person holding a rugby

play02:33

ball you get both person and rugby ball

play02:35

in the dataset it's got all kinds of

play02:38

classes including stairs here which are

play02:39

lovely Lee Illustrated and you can find

play02:42

this on github it's a nice data set to

play02:44

play around with so some colleagues and

play02:48

I did some work of saying ok what

play02:50

happens if we apply just a straight-up

play02:52

inception based model to this data there

play02:56

we trade it up and then we look at some

play02:57

how it classifies some images that we

play02:59

found on the web so here's one such

play03:02

image image that we found on the web all

play03:05

the images here are Creative Commons and

play03:07

stuff like that so it's it's OK for us

play03:09

to look at these and when we apply an

play03:14

image based a image nodi kind of all to

play03:17

this classifications we get back or kind

play03:19

of what I personally would expect I'm

play03:22

seeing things like bride dress ceremony

play03:24

woman wedding all things that as an

play03:28

American in this country at this time

play03:29

I'm thinking or make makes sense for

play03:31

this image cool maybe we did solve it

play03:34

image classification so then we applied

play03:38

it to another image also of a bride and

play03:44

the model that we had trained up on this

play03:48

open source image dataset returned the

play03:51

following classifications clothing event

play03:55

costume read and performance art no

play04:01

mention of bride also no mention of

play04:06

person miss regardless of gender so in a

play04:11

sense this this model is sort of like

play04:12

missed the fact that there's a human in

play04:14

the picture which is maybe not awesome

play04:17

and not really what I would think of as

play04:20

great success if we're claiming that

play04:23

image classification is solved

play04:26

ok so what's going on here

play04:30

I'm gonna argue a little bit that what's

play04:34

going on is is based in to some degree

play04:36

on the idea of stereotypes and if you're

play04:42

if you have your laptop up I'd like you

play04:43

to close your laptop for a second this

play04:44

is the interactive portion where you can

play04:46

interact by closing your laptop and I'd

play04:51

like you to find somebody sitting next

play04:53

to you and exercise your human

play04:55

conversation skills for about one minute

play04:58

to come up with a definition between the

play05:01

two of you of what is a stereotype

play05:04

keeping in mind that we're in sort of a

play05:06

statistical setting okay so have a quick

play05:08

one-minute conversation with the person

play05:10

sitting next to you if there's no one

play05:12

sitting next to you you may move ready

play05:13

set go

play05:20

three-two-one and thank you know for

play05:24

having that interesting conversation

play05:26

that easily could have lasted for much

play05:27

more than one minute but such as life

play05:30

let's hear from one or two folks let's

play05:33

had something that they came up with it

play05:34

was interesting

play05:35

oh yeah go ahead your name is adept yeah

play05:38

okay what did you okay so Dickie is

play05:47

saying that a stereotype is a

play05:48

generalization that you find from a

play05:50

large group of people and you apply it

play05:52

to more people okay interesting

play05:54

certainly agree with large parts that

play05:56

yeah okay so so I'm here the claim is

play06:12

that it's a label that's based on

play06:14

experience from within your training set

play06:16

yeah super interesting and the

play06:20

probability of label based on what's

play06:22

your training cool maybe one more oh

play06:25

yeah good okay so that there's claim

play06:36

here that stereotype has something to do

play06:38

with unrelated features that happen to

play06:39

be correlated I think that's interesting

play06:41

let me see if I can this was not a plant

play06:43

sorry your name was Constantine custody

play06:47

is not a plant but I do want to look at

play06:51

this a little bit more in detail so

play06:53

here's here's a data set that I'm going

play06:56

to claim is is based on running data so

play07:00

in the early mornings I pretend that I'm

play07:03

an athlete and go for a run and this is

play07:07

a data set that sort of based on risk

play07:10

that someone might not finish a race

play07:12

that they enter in so we've got high

play07:14

risk people or you know they are in

play07:16

yellow and lower risk people are in red

play07:19

you look in this data it's got a couple

play07:21

dimensions

play07:22

I might fit a linear classifier it's not

play07:25

quite perfect if I look a little more

play07:28

closely if I've actually got some more

play07:30

information here

play07:31

don't just have X&Y I also have this

play07:33

sort of color of outline so I might have

play07:36

a rule that if this data point has a

play07:40

blue outline I'm gonna predict low-risk

play07:43

otherwise I'm gonna predict high-risk

play07:45

fair enough

play07:48

now the big reveal you'll never guess

play07:50

what the the outline feature is based on

play07:56

shoe type the other x and y are based on

play07:59

how long the race is and sort of what a

play08:01

person's weekly training volume is but

play08:03

whether you're foolish enough to buy

play08:05

expensive running shoes because you

play08:07

think they're going to make you faster

play08:08

or whatever this is what's in the data

play08:11

and in traditional machine learning

play08:17

supervised machine learning we might say

play08:20

well wait a minute

play08:21

I'm not sure that shoe type is going to

play08:23

be actually predictive on the other hand

play08:26

it's in our training data and it does

play08:29

seem to be awfully predictive on this

play08:31

data set we have a really simple model

play08:32

it's highly regularized it still gives

play08:34

you no perfect or near perfect accuracy

play08:36

maybe it's fine and the only way we can

play08:42

find out if it's not I would argue is by

play08:45

gathering some more data and I'll point

play08:48

out that this data set has been

play08:50

diabolically constructed so that there

play08:53

are some points in the data space that

play08:56

are not particularly well represented

play08:57

and you can maybe tell yourself a story

play08:59

about maybe this data was collected

play09:01

after some corporate 5k or something

play09:05

like that so if we can collect some more

play09:08

data maybe we find that actually there's

play09:12

people wearing all kinds of shoes on

play09:14

both sides of our imaginary classifier

play09:17

but that this shoe type feature is

play09:21

really not predictive at all and this

play09:22

gets back to Constantine's point that

play09:25

perhaps relying on features that are

play09:28

strongly correlated but not necessarily

play09:30

causal may be a point at which we're

play09:33

thinking about a stereotype in some way

play09:36

so obviously given this data and what we

play09:39

know now I would probably go back and

play09:41

suggest a linear classifier based on

play09:43

these these features of length of race

play09:45

and weekly training volumes potentially

play09:47

a better model so how does this happen

play09:51

what's what's the issue here that's at

play09:55

play one of the issues that's at play is

play09:58

that in supervised machine learning we

play10:02

often make the assumption that our

play10:05

training distribution and our test

play10:07

distribution are identical right and we

play10:10

make this assumption for a really good

play10:12

reason which is that if we make that

play10:14

assumption then we can pretend that

play10:17

there's no difference between

play10:18

correlation and causation and we can use

play10:20

all of our features whether they're what

play10:22

Constantine would call you know

play10:23

meaningful or causal or not we can throw

play10:26

them in there and so long as their tests

play10:28

and training distributions are the same

play10:29

we're probably okay to within some some

play10:32

degree but in the real world we don't

play10:37

just apply models to a training or test

play10:39

set we also use them to make predictions

play10:42

that may influence the world in some way

play10:44

and there I think that the right sort of

play10:47

phrase to use isn't so much test set

play10:51

it's more inference time performance

play10:54

okay because that at inference time when

play10:57

we're going and applying our model to

play10:58

some new instance in the world we may

play11:00

not actually know what they let the true

play11:01

label is ever are things like that but

play11:03

we still care very much about having

play11:04

good performance and making sure that

play11:07

our test that our training set matches

play11:10

our inference distribution to some

play11:13

degree is is like super critical so

play11:17

let's go back to open images and what

play11:18

was happening there

play11:19

you'll recall that it did quite badly on

play11:23

at least anecdotally on that image of a

play11:26

bride who appeared to be from India if

play11:30

we look at the geo diversity of open

play11:32

images this is something where we we did

play11:33

our best to sort of track down the

play11:35

geolocation of each of the images in the

play11:37

open image data set what we found was

play11:40

that an overwhelming proportion of the

play11:43

data in open images was from North

play11:47

America and six countries in Europe

play11:50

vanishingly small amounts of that data

play11:52

were from countries such as India or

play11:54

China or other places where I've heard

play11:57

there's actually a large number of

play11:58

people

play11:59

so this is clearly not representative in

play12:04

a meaningful way of sort of the global

play12:07

diversity of the world how does this

play12:09

happen it's not like the researchers who

play12:12

put the open images data set were in any

play12:14

way little intention they were working

play12:15

really hard to put together what they

play12:17

believe was a more representative data

play12:19

set then of an image net at the very

play12:23

least they don't have a hundred

play12:23

categories of dogs in this one so what

play12:28

happens well you could make an argument

play12:29

that there's some strong correlation

play12:31

with the distribution of open images

play12:33

with the distribution of countries with

play12:37

high loca high bandwidth low-cost

play12:40

internet access it's not a perfect

play12:43

correlation but it's it's pretty close

play12:45

and that if we're doing if one might do

play12:50

things like base an image classifier on

play12:53

data drawn from a distribution of areas

play12:57

that have high bandwidth low cost

play12:59

internet access that may induce

play13:02

differences between the training

play13:04

distribution and the inference time

play13:06

distribution none of this is like

play13:10

something you wouldn't figure out

play13:11

without you know if you sat down for

play13:15

five minutes right this is all a super

play13:16

basic statistics it is in fact stuff

play13:19

that's this is just six people have been

play13:20

sort of railing at the machine learning

play13:22

community at for the last several

play13:23

decades but as machine learning models

play13:27

become sort of more ubiquitous in

play13:30

everyday life it thinks that paying

play13:32

attention to these kinds of issues

play13:34

becomes ever more important so let's go

play13:37

back to what a start a stereotype and I

play13:39

think I agree with Constantine's idea

play13:41

and I'm gonna add one more tweak to it

play13:43

so I'm gonna say that a stereotype is a

play13:46

statistical confounder

play13:47

I think it's using Constantine's

play13:49

language almost exactly that has a

play13:51

societal basis so when I think about

play13:57

issues of fairness if it's the case that

play14:01

you know rainy weather is correlated

play14:04

with people using umbrellas like yes

play14:06

that's a confounder the umbrellas did

play14:07

not cause the rain but I'm not as

play14:11

worried

play14:13

as a individual human about the societal

play14:15

impact of models that are based on that

play14:17

you know module I'm sure you could

play14:19

imagine some crazy scary scenario where

play14:21

that was the case but in general I don't

play14:24

think that's as large an issue but when

play14:25

we think of things like internet

play14:27

connectivity or other societally based

play14:29

factors I think that paying attention to

play14:32

questions of do we have confounders in

play14:34

our data are they being picked up by our

play14:36

models is as incredibly important so if

play14:42

you take away nothing else from this

play14:43

short talk I hope that you take away a

play14:46

caution to be aware of differences

play14:49

between your training and inference

play14:50

distributions ask the question because

play14:54

statistically this is not a particularly

play14:56

difficult thing to uncover if you take

play14:58

the time to look in a world of keggle

play15:02

competitions and people trying to get

play15:04

high marks on deep learning classes and

play15:06

things like that I think it's all too

play15:08

easy for us to just take datasets as

play15:10

given not think about them too much and

play15:13

just try and get our accuracy from 99.1

play15:15

to 99 points you and as someone who is

play15:20

interested in people coming out of

play15:21

programs like this being ready to do

play15:23

work in the real world I would caution

play15:26

that we can't only be training ourselves

play15:30

to do that so with that I'm gonna leave

play15:35

you with a set of additional resources

play15:37

around machine learning fairness

play15:38

these are super hot off the presses in

play15:40

the sense that this particular little

play15:42

website was launched and I think 8:30

play15:45

this morning something like that so

play15:46

you've you've got it first MIT leading

play15:50

the way in on this page there are n yeah

play15:55

you can open your laptop's again now

play15:58

there are a number of papers that go

play16:01

through this sort like a greatest hits

play16:03

of the machine learning fairness

play16:04

literature from the last couple years

play16:06

really interesting papers I don't think

play16:08

any of them are like the one final

play16:10

solution to machine learning fairness

play16:11

issues but they're super interesting

play16:14

reads and I think help sort of paint the

play16:16

the space in the landscape really

play16:18

usefully they're also a couple of

play16:20

interesting exercises there that you can

play16:23

access by a collab and

play16:26

if you're interested in this space

play16:28

they're things that you can play with

play16:30

I think they include one on adversarial

play16:32

D biasing where because you guys all

play16:35

love deep learning you can use a network

play16:38

to try and become unbiased by making

play16:43

sure that by having an extra output head

play16:45

that predicts a characteristic that you

play16:48

wish to be unbiased on and then

play16:50

penalizing that model if it's good at

play16:52

predicting that that characteristic and

play16:55

so this is trying to adversary only make

play16:57

sure that our internal representation in

play16:59

a deep network is not picking up

play17:01

unwanted correlations around water

play17:03

biases so I hope that that's interesting

play17:06

and I'll be around afterwards to take

play17:09

questions but at this point I'd like to

play17:11

make sure that sunchang has plenty of

play17:13

time so thank you very much

play17:15

[Applause]

Rate This

5.0 / 5 (0 votes)

Related Tags
Deep LearningImage ClassificationStereotypesData BiasInference DistributionAI FairnessMachine LearningGoogle BrainOpen ImagesStatistical Confounding