MIT 6.S191 (2018): Issues in Image Classification
Summary
TLDRThe speaker discusses the advancements in deep learning for image classification, highlighting the impressive error rate reductions in recent years. However, they point out the limitations of current models, using examples from the Open Images dataset, which may not accurately classify images due to biases in the training data. The talk emphasizes the importance of considering the differences between training and inference distributions and the potential societal impact of machine learning models. The speaker concludes by advocating for awareness of these issues and provides resources on machine learning fairness.
Takeaways
- 📈 The error rate in image classification on ImageNet has significantly decreased over the years, with contemporary results showing impressive accuracy.
- 🌐 The Open Images dataset, with 9 million images and 6,000 labels, is a more complex and diverse dataset compared to ImageNet.
- 🤖 Deep learning models may sometimes fail to recognize human presence in images, indicating that image classification is not entirely solved.
- 🔍 The concept of stereotypes in machine learning can be related to labels based on experience from the training set, which may not always be causally related to the outcome.
- 🌍 Geo-diversity in datasets is crucial; Open Images dataset is predominantly from North America and Europe, lacking representation from other regions.
- 🔄 The assumption in supervised learning that training and test distributions are identical is often not true in real-world applications, which can lead to biased models.
- 📊 It's important to consider the societal impact of machine learning models, especially when they are based on societally correlated features.
- 📚 Understanding and addressing machine learning fairness issues is becoming increasingly important as these models become more prevalent in everyday life.
- 🔗 Additional resources on machine learning fairness are available, including papers and interactive exercises to explore the topic further.
- 💡 The speaker emphasizes the importance of not just focusing on improving accuracy but also on the societal implications of machine learning models.
Q & A
What is the main topic of the talk?
-The main topic of the talk is the issues with image classification using deep learning, particularly focusing on the challenges and biases in datasets like ImageNet and Open Images.
How has the error rate in image classification changed over the years?
-The error rate in image classification has significantly reduced over the years, with contemporary results showing an error rate of around 2.2%, which is considered astonishing compared to the 25% error rate in 2011.
What is the difference between ImageNet and Open Images datasets?
-ImageNet has around 1 million images with 1,000 labels, while Open Images has about 9 million images with 6,000 labels, and it supports multi-label classification.
Why did the deep learning model fail to recognize a bride in one of the images discussed in the talk?
-The model failed because it was trained on a dataset that did not represent global diversity well, particularly lacking data from regions like India, leading to a biased model that missed the presence of a human in the image.
What is a stereotype in the context of machine learning?
-In the context of machine learning, a stereotype is a statistical confounder that has a societal basis, which may lead to biased models picking up on correlations that are not causally related.
What is the importance of considering the training and inference distributions in machine learning models?
-Considering the training and inference distributions is crucial to ensure that the model's performance is not just limited to the training data but also generalizes well to new, unseen data in the real world.
How does the geolocation diversity in the Open Images dataset compare to global diversity?
-The Open Images dataset is not representative of global diversity, as the majority of the data comes from North America and a few European countries, with very little data from places like India or China.
What is the potential issue with using a feature like shoe type in a machine learning model?
-Using a feature like shoe type can lead to biased predictions if it is strongly correlated with the target variable in the training data but not necessarily causally related, which may not generalize well to the real world.
What is the speaker's advice for individuals interested in machine learning fairness?
-The speaker advises being aware of differences between training and inference distributions, asking questions about confounders in the data, and not just focusing on improving accuracy but also considering the societal impact of the models.
What resources does the speaker recommend for further understanding of machine learning fairness?
-The speaker recommends a website with a collection of papers on machine learning fairness, as well as interactive exercises on adversarial debiasing to explore the topic further.
Outlines
🌟 Introduction to Deep Learning and Image Classification
The speaker begins by introducing themselves as being based in the Cambridge office and discusses their work with deep learning at Google Brain. They mention the significant progress in image classification accuracy over the years, particularly on the ImageNet dataset, and the transition from human-level performance to surpassing it. The speaker also introduces the topic of the talk, which is about the challenges and stereotypes in image data sets, and transitions to their colleague Sunshine Kai's discussion on TensorFlow debugger and eager mode.
🤔 Stereotypes in Machine Learning
The speaker delves into the concept of stereotypes in machine learning, using an interactive exercise to define stereotypes as labels based on experiences within a training set. They discuss the potential issues with relying on features that are correlated but not causal, using a hypothetical dataset about running shoe types and race completion risk as an example. The speaker emphasizes the importance of considering whether features in the training data are truly predictive or just correlated, and the societal implications of stereotypes in machine learning models.
🌐 Global Diversity in Image Data Sets
The speaker addresses the issue of global diversity in image datasets, specifically the Open Images dataset, which is found to be heavily skewed towards North America and certain European countries. They highlight the importance of the training distribution matching the inference distribution to ensure fair and accurate predictions. The speaker also discusses the societal factors that can lead to confounding in data, such as internet connectivity, and the need for machine learning practitioners to be aware of these issues to avoid perpetuating stereotypes.
📚 Resources on Machine Learning Fairness
The speaker concludes by providing resources on machine learning fairness, including a newly launched website with papers and exercises on the topic. They mention an adversarial debiasing exercise that aims to prevent a network from picking up unwanted correlations. The speaker encourages the audience to consider the broader impact of their work in machine learning and to be mindful of fairness issues, especially as machine learning models become more integrated into everyday life.
Mindmap
Keywords
💡Deep Learning
💡Image Classification
💡Stereotypes
💡Open Images
💡TensorFlow Debugger
💡Inference Time Performance
💡Statistical Confounder
💡Fairness in Machine Learning
💡Adversarial D-biasing
💡Training and Test Distributions
Highlights
Deep learning advancements have significantly reduced error rates in image classification.
ImageNet dataset has seen a remarkable improvement in classification accuracy over the years.
Open Images dataset, with 9 million images and 6,000 labels, presents a more complex challenge than ImageNet.
Inception-based models may not always capture human elements in images, raising questions about the effectiveness of image classification.
Stereotypes in machine learning can be based on features that are correlated but not causal.
The assumption in supervised machine learning is that training and test distributions are identical, which may not hold in real-world applications.
The importance of matching training set and inference distribution to ensure fair and effective model performance.
Open Images dataset's geolocation data reveals a lack of global diversity, with a heavy bias towards North America and certain European countries.
The potential societal impact of machine learning models based on societally correlated features.
The concept of stereotypes as statistical confounders with a societal basis.
The need for awareness of differences between training and inference distributions in machine learning applications.
The importance of not just optimizing for accuracy but also considering the societal implications of machine learning models.
Resources for further exploration of machine learning fairness, including papers and interactive exercises.
The use of adversarial techniques to reduce unwanted correlations and biases in deep learning models.
The speaker's emphasis on the need for a balanced approach to machine learning that includes fairness and societal impact considerations.
The speaker's call to action for the audience to engage with resources on machine learning fairness.
Transcripts
thanks for having me here yeah so I'm
I'm based in the Cambridge office which
is like a hundred meters that way um and
we do a lot of stuff with deep learning
we've got a large group in Google brain
and other related fields so hopefully
that's interesting to some of you at
some point so I'm gonna talk for about
20 minutes or so
um this sort of image issues in image
classification theme I'm gonna hand it
over to my excellent colleague sunshine
Kai who's going to go through an
entirely different subject in using
tensor flow debugger and eager mode to
make work in tensor flow easier who's
maybe that would be good okay so let's
let's take a step back so if you guys
seen happy graphs like this before go
ahead and smile and not if you've seen
stuff like this yeah okay so this is a
happy graph on image net based image
classification so image net is a dataset
of million some odd images for this
challenge there were a thousand classes
and in 2011 back in the dark ages when
nobody knew how to do anything the state
of the art was something like 25% error
rate on this stuff and in the last call
it six seven years the reduction in
error rate has been kind of astounding
to the point where it's now been talked
about so much it's like no longer even
surprising and it was like yeah yeah we
see this human error rate is somewhere
between five and ten percent on this
task so the contemporary results of you
know 2.2 or whatever it is percent error
rate are really kind of astonishing and
you can look at a graph like this and
make reasonable claims that well
machines using deep learning are better
than humans at image classification on
this task that's kind of weird and kind
of amazing and maybe we can declare
victory and fill audiences full of
people clamoring to learn about deep
learning that's cool okay so um I'm
gonna talk not about image data itself
but about a slightly different image
data set basically people were like okay
obviously image net is too easy let's
make a large
more interesting data set so open images
was released I think a year or two ago
it's got about 9 million as opposed to 1
million images the the base dataset has
6,000 labels as opposed to 1000 labels
this is also multi labels so you get you
know if there's a person holding a rugby
ball you get both person and rugby ball
in the dataset it's got all kinds of
classes including stairs here which are
lovely Lee Illustrated and you can find
this on github it's a nice data set to
play around with so some colleagues and
I did some work of saying ok what
happens if we apply just a straight-up
inception based model to this data there
we trade it up and then we look at some
how it classifies some images that we
found on the web so here's one such
image image that we found on the web all
the images here are Creative Commons and
stuff like that so it's it's OK for us
to look at these and when we apply an
image based a image nodi kind of all to
this classifications we get back or kind
of what I personally would expect I'm
seeing things like bride dress ceremony
woman wedding all things that as an
American in this country at this time
I'm thinking or make makes sense for
this image cool maybe we did solve it
image classification so then we applied
it to another image also of a bride and
the model that we had trained up on this
open source image dataset returned the
following classifications clothing event
costume read and performance art no
mention of bride also no mention of
person miss regardless of gender so in a
sense this this model is sort of like
missed the fact that there's a human in
the picture which is maybe not awesome
and not really what I would think of as
great success if we're claiming that
image classification is solved
ok so what's going on here
I'm gonna argue a little bit that what's
going on is is based in to some degree
on the idea of stereotypes and if you're
if you have your laptop up I'd like you
to close your laptop for a second this
is the interactive portion where you can
interact by closing your laptop and I'd
like you to find somebody sitting next
to you and exercise your human
conversation skills for about one minute
to come up with a definition between the
two of you of what is a stereotype
keeping in mind that we're in sort of a
statistical setting okay so have a quick
one-minute conversation with the person
sitting next to you if there's no one
sitting next to you you may move ready
set go
three-two-one and thank you know for
having that interesting conversation
that easily could have lasted for much
more than one minute but such as life
let's hear from one or two folks let's
had something that they came up with it
was interesting
oh yeah go ahead your name is adept yeah
okay what did you okay so Dickie is
saying that a stereotype is a
generalization that you find from a
large group of people and you apply it
to more people okay interesting
certainly agree with large parts that
yeah okay so so I'm here the claim is
that it's a label that's based on
experience from within your training set
yeah super interesting and the
probability of label based on what's
your training cool maybe one more oh
yeah good okay so that there's claim
here that stereotype has something to do
with unrelated features that happen to
be correlated I think that's interesting
let me see if I can this was not a plant
sorry your name was Constantine custody
is not a plant but I do want to look at
this a little bit more in detail so
here's here's a data set that I'm going
to claim is is based on running data so
in the early mornings I pretend that I'm
an athlete and go for a run and this is
a data set that sort of based on risk
that someone might not finish a race
that they enter in so we've got high
risk people or you know they are in
yellow and lower risk people are in red
you look in this data it's got a couple
dimensions
I might fit a linear classifier it's not
quite perfect if I look a little more
closely if I've actually got some more
information here
don't just have X&Y I also have this
sort of color of outline so I might have
a rule that if this data point has a
blue outline I'm gonna predict low-risk
otherwise I'm gonna predict high-risk
fair enough
now the big reveal you'll never guess
what the the outline feature is based on
shoe type the other x and y are based on
how long the race is and sort of what a
person's weekly training volume is but
whether you're foolish enough to buy
expensive running shoes because you
think they're going to make you faster
or whatever this is what's in the data
and in traditional machine learning
supervised machine learning we might say
well wait a minute
I'm not sure that shoe type is going to
be actually predictive on the other hand
it's in our training data and it does
seem to be awfully predictive on this
data set we have a really simple model
it's highly regularized it still gives
you no perfect or near perfect accuracy
maybe it's fine and the only way we can
find out if it's not I would argue is by
gathering some more data and I'll point
out that this data set has been
diabolically constructed so that there
are some points in the data space that
are not particularly well represented
and you can maybe tell yourself a story
about maybe this data was collected
after some corporate 5k or something
like that so if we can collect some more
data maybe we find that actually there's
people wearing all kinds of shoes on
both sides of our imaginary classifier
but that this shoe type feature is
really not predictive at all and this
gets back to Constantine's point that
perhaps relying on features that are
strongly correlated but not necessarily
causal may be a point at which we're
thinking about a stereotype in some way
so obviously given this data and what we
know now I would probably go back and
suggest a linear classifier based on
these these features of length of race
and weekly training volumes potentially
a better model so how does this happen
what's what's the issue here that's at
play one of the issues that's at play is
that in supervised machine learning we
often make the assumption that our
training distribution and our test
distribution are identical right and we
make this assumption for a really good
reason which is that if we make that
assumption then we can pretend that
there's no difference between
correlation and causation and we can use
all of our features whether they're what
Constantine would call you know
meaningful or causal or not we can throw
them in there and so long as their tests
and training distributions are the same
we're probably okay to within some some
degree but in the real world we don't
just apply models to a training or test
set we also use them to make predictions
that may influence the world in some way
and there I think that the right sort of
phrase to use isn't so much test set
it's more inference time performance
okay because that at inference time when
we're going and applying our model to
some new instance in the world we may
not actually know what they let the true
label is ever are things like that but
we still care very much about having
good performance and making sure that
our test that our training set matches
our inference distribution to some
degree is is like super critical so
let's go back to open images and what
was happening there
you'll recall that it did quite badly on
at least anecdotally on that image of a
bride who appeared to be from India if
we look at the geo diversity of open
images this is something where we we did
our best to sort of track down the
geolocation of each of the images in the
open image data set what we found was
that an overwhelming proportion of the
data in open images was from North
America and six countries in Europe
vanishingly small amounts of that data
were from countries such as India or
China or other places where I've heard
there's actually a large number of
people
so this is clearly not representative in
a meaningful way of sort of the global
diversity of the world how does this
happen it's not like the researchers who
put the open images data set were in any
way little intention they were working
really hard to put together what they
believe was a more representative data
set then of an image net at the very
least they don't have a hundred
categories of dogs in this one so what
happens well you could make an argument
that there's some strong correlation
with the distribution of open images
with the distribution of countries with
high loca high bandwidth low-cost
internet access it's not a perfect
correlation but it's it's pretty close
and that if we're doing if one might do
things like base an image classifier on
data drawn from a distribution of areas
that have high bandwidth low cost
internet access that may induce
differences between the training
distribution and the inference time
distribution none of this is like
something you wouldn't figure out
without you know if you sat down for
five minutes right this is all a super
basic statistics it is in fact stuff
that's this is just six people have been
sort of railing at the machine learning
community at for the last several
decades but as machine learning models
become sort of more ubiquitous in
everyday life it thinks that paying
attention to these kinds of issues
becomes ever more important so let's go
back to what a start a stereotype and I
think I agree with Constantine's idea
and I'm gonna add one more tweak to it
so I'm gonna say that a stereotype is a
statistical confounder
I think it's using Constantine's
language almost exactly that has a
societal basis so when I think about
issues of fairness if it's the case that
you know rainy weather is correlated
with people using umbrellas like yes
that's a confounder the umbrellas did
not cause the rain but I'm not as
worried
as a individual human about the societal
impact of models that are based on that
you know module I'm sure you could
imagine some crazy scary scenario where
that was the case but in general I don't
think that's as large an issue but when
we think of things like internet
connectivity or other societally based
factors I think that paying attention to
questions of do we have confounders in
our data are they being picked up by our
models is as incredibly important so if
you take away nothing else from this
short talk I hope that you take away a
caution to be aware of differences
between your training and inference
distributions ask the question because
statistically this is not a particularly
difficult thing to uncover if you take
the time to look in a world of keggle
competitions and people trying to get
high marks on deep learning classes and
things like that I think it's all too
easy for us to just take datasets as
given not think about them too much and
just try and get our accuracy from 99.1
to 99 points you and as someone who is
interested in people coming out of
programs like this being ready to do
work in the real world I would caution
that we can't only be training ourselves
to do that so with that I'm gonna leave
you with a set of additional resources
around machine learning fairness
these are super hot off the presses in
the sense that this particular little
website was launched and I think 8:30
this morning something like that so
you've you've got it first MIT leading
the way in on this page there are n yeah
you can open your laptop's again now
there are a number of papers that go
through this sort like a greatest hits
of the machine learning fairness
literature from the last couple years
really interesting papers I don't think
any of them are like the one final
solution to machine learning fairness
issues but they're super interesting
reads and I think help sort of paint the
the space in the landscape really
usefully they're also a couple of
interesting exercises there that you can
access by a collab and
if you're interested in this space
they're things that you can play with
I think they include one on adversarial
D biasing where because you guys all
love deep learning you can use a network
to try and become unbiased by making
sure that by having an extra output head
that predicts a characteristic that you
wish to be unbiased on and then
penalizing that model if it's good at
predicting that that characteristic and
so this is trying to adversary only make
sure that our internal representation in
a deep network is not picking up
unwanted correlations around water
biases so I hope that that's interesting
and I'll be around afterwards to take
questions but at this point I'd like to
make sure that sunchang has plenty of
time so thank you very much
[Applause]
5.0 / 5 (0 votes)