Week 3 -- Capsule 1 -- Nearest Neighbor
Summary
TLDRDans ce script, l'introduction au machine learning supervisé est abordée, notamment le modèle 'nearest neighbor'. Le focus est sur les modèles linéaires, principalement pour la classification, bien que mentionnant leur extension à la régression et l'estimation de densité. Les modèles sont divisés en modèles non-probabilistes (comme 'nearest neighbor' et 'support vector machines') et modèles probabilistes (comme 'naive bayes'). L'approche 'nearest neighbor' est expliquée, y compris la sélection du nombre de voisins (hyperparamètre k) à l'aide d'un ensemble de validation pour optimiser les performances. Les avantages et les défis de l'utilisation de 'nearest neighbor', notamment la nécessité de stocker l'ensemble des données et la robustesse en présence d'outliers, sont discutés.
Takeaways
- 📚 Cette semaine, nous introduisons l'apprentissage supervisé et le modèle d'apprentissage supervisé appelé 'plus proche voisin'.
- 🔍 Nous allons explorer différents modèles d'apprentissage supervisé, principalement linéaires, avec un focus sur la classification, bien que ces modèles puissent être étendus à la régression ou à l'estimation de densité.
- 🏫 Nous aborderons deux types de modèles : les modèles non-probabilistes (comme le plus proche voisin et les machines à vecteurs de soutien) et les modèles probabilistes (notamment le modèle de Bayes naïf).
- 🏠 Dans l'apprentissage supervisé, nous avons un ensemble de caractéristiques ou de fonctionnalités et une cible, appelée 'y', qui peut être une classification (par exemple, la rapidité de vente d'une maison).
- 🔍 Le modèle 'plus proche voisin' (KNN) est un modèle non paramétrique simple et puissant qui ne nécessite pas de phase d'apprentissage : il utilise les instances d'apprentissage pour prédire les voisins du test.
- 🔢 Le KNN peut être étendu au 'k plus proches voisins', où une nouvelle instance est classée selon le vote majoritaire de ses 'k' voisins les plus proches.
- 🔧 Le choix du paramètre 'k' est crucial et peut être optimisé à l'aide d'un jeu de validation pour trouver la valeur qui donne la meilleure performance.
- 📈 L'utilisation de 'k' plus élevé peut rendre la frontière de décision plus lisse, tandis que 'k' plus faible peut être moins robuste face aux bruits ou aux valeurs atypiques.
- 💾 Le KNN est un approche non paramétrique qui nécessite de stocker l'ensemble des données d'apprentissage et peut être coûteux en termes de recherche des voisins les plus proches.
- 🔄 Les bibliothèques telles que scikit-learn permettent des recherches plus rapides des voisins les plus proches, parfois en utilisant des approximations ou en prétraitant les données pour une recherche plus efficace.
Q & A
Quel est le modèle d'apprentissage supervisé introduit en premier dans le script ?
-Le premier modèle d'apprentissage supervisé introduit est le modèle 'nearest neighbor'.
Quels sont les deux types de modèles d'apprentissage supervisé abordés dans le script ?
-Les deux types de modèles d'apprentissage supervisé abordés sont les modèles non-probabilistes, y compris les voisins les plus proches et les machines à vecteurs de soutien, et les modèles probabilistes, en particulier les modèles de Bayes naïfs.
Pourquoi le modèle 'nearest neighbor' est-il choisi comme premier modèle dans le script ?
-Le modèle 'nearest neighbor' est choisi en premier parce qu'il est conceptuellement très simple et qu'il est un modèle puissant.
Comment fonctionne le modèle 'nearest neighbor' pour la classification ?
-Le modèle 'nearest neighbor' classe une instance en fonction de son voisin le plus proche pour le '1 nearest neighbor' ou en fonction du vote majoritaire de ses 'k' voisins les plus proches pour le 'k nearest neighbor'.
Quelle est la différence entre le '1 nearest neighbor' et le 'k nearest neighbor' ?
-Le '1 nearest neighbor' classe une instance selon son premier voisin le plus proche, tandis que le 'k nearest neighbor' classe une instance selon le vote majoritaire de ses 'k' voisins les plus proches.
Comment le choix de 'k' affecte-t-il la performance du modèle 'nearest neighbor' ?
-Le choix de 'k' est un hyperparamètre qui affecte fortement la performance du modèle. Un 'k' trop petit peut être peu robuste, tandis qu'un 'k' trop grand peut sur-ajuster au jeu de données d'apprentissage.
Quels sont les défis associés à l'utilisation du modèle 'nearest neighbor' ?
-Les défis incluent la nécessité de stocker l'ensemble du jeu de données, la recherche des voisins les plus proches qui peut être coûteuse, et les performances réduites en présence de données à haute dimension en raison du phénomène de 'curse of dimensionality'.
Comment le modèle 'nearest neighbor' peut-il être étendu au-delà de la classification ?
-Le modèle 'nearest neighbor' peut être étendu aux problèmes de régression en prenant par exemple la moyenne des valeurs des voisins les plus proches pour prédire une cible continue.
Quelle est la garantie théorique fournie par le script concernant le modèle 'nearest neighbor' ?
-Si le jeu de données d'apprentissage tend vers l'infini et que 'k' est égal à un, l'erreur de test est limitée par deux fois l'erreur optimale.
Comment le script suggère-t-il de choisir le meilleur 'k' pour le modèle 'nearest neighbor' ?
-Le script suggère d'utiliser un jeu de validation pour essayer différentes valeurs de 'k' et de choisir celle qui donne les meilleurs résultats sur ce jeu de validation.
Outlines
📚 Introduction au Machine Learning Supervisé
Dans ce paragraphe, l'enseignant présente un aperçu de la semaine consacrée au machine learning supervisé, en abordant notamment le modèle de voisinage le plus proche. Il mentionne qu'ils exploreront différents modèles, principalement linéaires, avec un focus sur la classification, bien que ces modèles puissent être étendus à la régression ou à l'estimation de densité. Deux types de modèles seront traités : les modèles non-probabilistes (comme le plus proche voisin et les machines à vecteurs de soutien) et les modèles probabilistes (comme le modèle de Bayes naïf).
🔍 Le Modèle de Voisinage le Plus Proche
Le paragraphe explique le modèle de plus proche voisin (NN), qui est un modèle non-paramétrique simple et puissant. Aucune phase d'apprentissage n'est nécessaire pour ce modèle ; au moment du test, on se réfère aux instances d'apprentissage pour faire des prédictions en fonction des voisins du test. L'exemple illustre comment classer un point en fonction de sa proximité avec d'autres points de la même classe. L'idée est ensuite élargie au modèle K plus proches voisins (KNN), où une nouvelle instance est classée selon le vote majoritaire de ses K voisins les plus proches.
🔄 Impact du Paramètre K sur la Classification
Ce paragraphe discute de l'impact du paramètre K dans le KNN sur la classification des données. Un K plus élevé donne une frontière de décision plus lisse, tandis que K=1 est moins robuste face aux outliers. L'enseignant suggère que le choix de K devrait être basé sur des performances évaluées sur un ensemble de validation. Il est également mentionné que l'utilisation de tous les voisins pourrait nécessiter un poids inversement proportionnel à la distance pour les voisins les plus proches.
📉 Propriétés et Limitations du Modèle de Voisinage le Plus Proche
Les propriétés du modèle de plus proche voisin sont abordées, y compris son caractère non-paramétrique et la nécessité de stocker l'ensemble des données d'apprentissage. L'enseignant mentionne que la recherche des voisins les plus proches peut être coûteuse en temps, surtout pour de grands ensembles de données. Il est également question de la possibilité d'utiliser des fonctions de distance autres que la distance euclidienne pour les données non-continues. Les limitations, telles que la performance dégradée dans les espaces de très haute dimension en raison du phénomène de la 'malediction de la dimensionalité', sont également discutées.
Mindmap
Keywords
💡Apprentissage supervisé
💡Modèle de voisin le plus proche (Nearest Neighbor)
💡Modèles linéaires
💡Classification
💡Modèles non-probabilistes
💡Modèles probabilistes
💡Hyperparamètres
💡Validation set
💡Distance de Euclide
💡Voisins les plus proches (k-NN)
Highlights
Introduction to supervised learning and the nearest neighbor model.
Focus on linear models for classification with potential extension to regression or density estimation.
Discussion of non-probabilistic models such as nearest neighbor and support vector machines.
Introduction of probabilistic models, specifically naive Bayes.
Supervised learning involves learning a function from features to a target classification.
The nearest neighbor approach as a non-parametric model without a training phase.
Explanation of how the nearest neighbor model predicts based on the closest training instance.
Concept of extending the nearest neighbor approach to k-nearest neighbors for classification.
Importance of selecting the right value for k, the number of neighbors, using a validation set.
Demonstration of how k-nearest neighbors can be more robust than 1-nearest neighbor.
Visualization of decision boundaries with different values of k using scikit-learn.
Properties of nearest neighbors, including its non-parametric nature and requirement to store the entire dataset.
Discussion on the computational expense of searching for nearest neighbors and ways to optimize it.
Adaptability of nearest neighbors to different types of data and distance functions.
Theoretical guarantee that nearest neighbors' test error is bounded for large training sets.
Challenges of using nearest neighbors in high-dimensional spaces due to the curse of dimensionality.
Summary of nearest neighbors as a simple yet powerful model for both classification and regression tasks.
Transcripts
in this capsule we're going to introduce
the outline of what we're going to do
this week
which is supervised learning in addition
i'll also introduce the first supervised
learning model
this semester called the nearest
neighbor model
so we're going to introduce different
models for supervised learning the
models are mostly going to be linear
models
and we're going to focus on
classification but as you'll see
many of these models can actually be
extended to regression or even density
estimation set we're going to focus on
two types of models
one which you may already be more
familiar with are called
for non-probabilistic models including
things like nearest neighbor
and support vector machines the second
is
probabilistic models in particular we're
going to introduce the naive bayes
models to talk about
lots of the probabilistic modeling
concepts
okay so recall that in supervised
learning you have a set of
characteristics
or features and you have a target which
we called
y here we're in a case where
a classification case and so our
y are not real valued like it's not the
price of a house
and here it's actually whether or not
your house will sell quickly
and there are three possibilities either
your house will not sell quickly
either your house will sell
semi-quickly or either your house your
house will sell very quickly
so zero one and two
the task that we have is we want to
learn
a function or a model f
that goes from these characteristics to
a particular classification zero one
or two of course what we want to do with
this
task once we've learned it using
training data is
then to be able to apply it onto test
data
that is to predict the sellability of
new houses that we've never seen before
okay so this is a quick recall of the
supervised learning setting
um and in this week
we're mostly going to focus on actual
models so that
these functions f that allow us to
predict the celebrated sellability of
the house
given the characteristics of a house
the first model we're going to introduce
is the
nearest neighbor approach or nn
i chose it first because it's
conceptually very simple
yet it's quite a powerful model
the near snapper approach belongs to the
class of non-parametric models
in short and this may make more sense in
a couple weeks
there is no training phase for the near
snaper mod
and so what you do at test time is that
you actually remember
the training instances and you predict
according to the neighbors
of the test instance okay so let's have
a
look at what that means okay here i have
data that belongs to either the green
class
or the blue class okay so i have a
binary classification problem
in addition each datum lives in two
dimension
okay so it has an x1 component and an x2
component
okay so now if i were to ask
um what does this what which class does
this point correspond to
okay well it's likely that our human
intuition would tell us that it belongs
to
the blue class okay and that
probably is because it's very close to
other blue instances and it's also
further from the green instances or the
green data
okay now imagine that i show you a point
that's
closer to the middle of what you
perceive as being the boundary
between these two classes then how would
you classify this point
so one thing you could do is you could
look at its closest neighbors
okay so the point that it's the closest
to
this is effectively what um
nearest neighbor does or what nearest
neighbor does okay so
we call this point x j and then we say
that we're gonna look at
the closest point to xj this close point
here i've called
x i prime okay and then we're going to
predict
the class of xj to be the same as a
class of its
nearest neighbor x i prime so in this
case we would predict
that xj belongs to the green class okay
this is called
one nearest neighbor we simply classify
an instance
according to its first nearest neighbor
okay mathematically you can think of it
as follows
first you find the index of the point
which i call i prime
that is the closest to your query point
xj
okay so you look over all points and you
pick the one
you keep the index of the one that is
the closest it's this one here in this
picture
and then you say that the class of your
test intense xj so its class yj is
simply going to be equal to
the class of its closest point so here
it's
y i prime and it's the green class
okay now perhaps intuition says that
looking at one neighbor is a good idea
but you could look you should look at
more neighbors
okay and so we can actually extend this
idea of one nearest neighbor
classification
to k nearest neighbor classification
in this case a new instance xj for
example
will be classified according to the
majority vote of its
k nearest neighbors okay so let's look
at this example
imagine that we assume that k is equal
to five okay and general k is a hyper
parameter
that you will need to set of course the
the performance of your algorithm will
can strongly depend on this hyper brown
okay so here we say that we're looking
at the five closest neighbors i've
identified them as being these five
points
we know that three of them are blue and
two of them are green
so if we take a majority vote it's three
against two in favor of the blue class
okay so according to uh k n
when k is equal to five so five and n
this point x j
should belong to the blue cast so notice
that
the decision the prediction rather i
should say that
5 and n would be different than the
prediction of 1
in this particular case okay also here
on the left hand side
i've written down in a more
formal mathematic using more formal
mathematics formula
what i've just said in words
okay so perhaps um now you say
that okay this sounds reasonable right
um if i have a new point
i'm going to look at its neighbors and
i'm going to base my causation
on that but really how many neighbors
should i be looking at
okay how do i set k in practice so
in practice what you could do is you
could
select or pick a validation set
okay this is something i've talked about
in the capsules from the last week okay
and then you could try um you could
evaluate the performance
of different values of k and simply pick
k that had that yields the best
performance on your
validation set alternatively you could
say something like
i'm going to use all neighbors
okay and of course
if that's not enough basically right
so if you use all neighbors then you
probably would want to weight the
neighbors
by how close they are to your new
instance okay so in this particular case
maybe i'm looking at all these neighbors
but of course the neighbors that are the
closest
right intuitively they should play the
biggest role and so then their weight
is going to be the largest so you could
think that the weight could be inversely
proportional to the distance
the points that are the the closest
right the smallest distance
have the higher weight
okay so now i want to build some of your
intuitions about what would happen
in with k in different cases
and so here's the first example so i
still have my data set of green points
and blue points in addition you'll
notice here that
i've colored the background of these
plots
and when the background is green it
means that if a new instance
appears in in this in this space which
is green then it would be labeled as
green
and of course if it if a new instance
appears in
a place where the background is blue
then it would be classified
or predicted as being from the blue
class okay
what i'm comparing now on the left and
on the right
is a nearest neighbor approach where
we're using
here the three nearest neighbor so k is
equal to three versus
a nearest neighbor approach where we're
using only a single nearest neighbor so
k
is equal to one one thing i want you to
notice
is that the decision boundary
is much smoother when k is higher
okay so here you can see that decision
boundary
at certain places takes very sharp
angles
okay so it's unclear just looking at
this which one you should
pick but at least intuitively it seems
like
there's no absolute given the the data
and the geometry of our of our the data
we have access to
then there's no absolute reason or no
clear reason i should say
uh why there should be such a sharp
angle
at this point and at this point okay so
at least intuitively speaking it seems
like
having a higher number of neighbors may
make more sense
okay before i move on to the next slide
let me just mention that
i obtained this plot using the scikit
learning library so scikit-learn is a
python toolbox for machine learning
it's great we'll be using it fairly
thoroughly in the class
and actually we'll spend a full week i
uh in a couple of weeks
to explore the various possibilities and
various models that exist
in in scikit look
okay here's another example and now
we're going to look at the robustness of
the classifier so in this particular
case i reuse my data set from the
previous slide but i also added
what i would call the noisy instance or
an outlier
okay and what i compare again is on the
left k is equal to 3
2 on the right k is equal to 1. it seems
like k is equal to 1 is a lot less
robust to this particular outlier
in particular it now would assume that
there's a large
region here which i would say clearly
belongs to the
green class that would actually be
classified as blue
we see that the effect is much smaller
for the
um for the case when k is equal to three
and
actually this zone stays green and
there's just a small zone here that
would become blue
and again this i think it would be
difficult to explain exactly
what why that should be okay
so um based on these two slides
one of the interesting you can have is
that in general
you know larger case may be helpful
okay but of course it's not uh
it's not that larger is always better
okay in fact if k was equal to n then we
would just predict according to
the the number uh we would sorry always
predict the same thing
and in this case since there are more
green points and blue points we would
always predict
the the class should be green okay
so again k is a hyper parameter which
you should
set um using a validation set
and you know if k is equal to one
doesn't
quite work as well as you think we also
know that k is equal to n
will have issues and so in practice
you're going to want to try different
k's and keep the one that does the best
again on this validation set
okay so a few properties of nearest
neighbors uh first
as we said before it's a non-parametric
approach
and as we've made it very clear from the
last couple of sites
it will require storing the whole data
set okay so that means that
if your data set for example is too big
it does not fit on your on a single
computer
then using nearest neighbor will be very
difficult
uh further searching for nearest
neighbor of a new datum can be expensive
okay particularly you could imagine that
a naive search would mean that you're
going to compare
you're going to look at the distance
between the new instance and all of your
other instances okay now and so the
order of that operation is o of n
okay where n is the size of your
training data and so if you have a large
test set
then that search can be can become very
expensive
what cycle learn and other libraries
allow you to do
is to is to have faster searches and
some of these searches are approximate
um in general the the intuition behind
some of these faster searches
is that you're going to be fixed costs
at the beginning
to pre-process your training data okay
and you're going to pre-process it in a
particular structure
that allows for much faster retrieval of
nearest neighbors
so that's one thing and that's an option
that exists in psychic learning
the second thing is that in my example
so far
i've imagined that my data was
continuous okay
and so but nearest neighbor can also be
used
when the data when you're using
non-continuous data in particular when
you're using continuous data
you're probably going to want to use the
euclidean distance
function okay and when you're using
other types of data then what you would
want to do
is change a distance function okay so
that's what allows you to
calculate the distance between different
neighbors to a non-euclidean distance
and again sagatlearn and other libraries
have various
standard options for doing that
okay a few more properties first
we know that nearest neighbor
if your training set tends to infinity
and k is equal to one then you know that
your
test error will be bounded by twice the
optimal error okay
so um that's that could be reassuring in
the sense right
that you know that for large very large
data set
then effectively your test error is
going to be
close to the optimal error that
a machine learning model could obtain
okay
now of course um it's interesting and
worth worthwhile to know and to remember
but
it does not tell us very much about what
happens when
n does not tend to infinity and
especially when
n is is small right so if you have a
small data set then
we don't this this of course would would
not apply
um and it's not even clear that you
could derive some intuition by
for small data sets looking at this
particular theoretical guarantee
the second thing i'd like to mention is
that nearest neighbors may not perform
very well in high
with high dimensional inputs due to the
curse of dimensionality
okay the intuition here is that
in very high dimensional spaces
you will need lots and lots of data okay
if you don't have lots lots of data
then your neighbors or the neighbors of
a new instance may actually be very
far from it okay and that of course
could become could be an issue right
because the intuition behind nearest
neighbors
is that we're going to look at a local
neighborhood right because we imagine
that locally things are similar
okay now if we need to look at houses
that
in my high dimensional space are very
far
right then you could imagine that the
performance of nearest neighbor
would degrade tremendously
okay this of course this argument relies
on the fact that you have uniform data
and so again this is interesting
but in practice you may still want to
try it out even with your high
dimensional data
you may still want to try it out and and
see how well your nearest neighbor
will work on your particular data set
okay
okay so nearest neighbor summaries again
near sabres is a non-parametric approach
it does not require fitting parameters
using training data
instead it uses the training data to
directly to be able to predict
the classes of test sentences or test
data
at least in its vanilla formulation the
only
hyper parameter that you have to set is
the number of neighbors
again you can set that using a
validation set
and now you may say yeah but in cycle
learn there are lots of
other parameters that i can set such as
uh which distance function am i going to
use
right and it's absolutely true
and so i think in practice you should
look at these various parameters and
again you can all set them
using a i'll set them using a validation
set
but at least um conceptually speaking
the only hyper parameter
we really have to set is the number of
neighbors
finally uh i'll mention this something i
mentioned in one of the
initial slides but here i i'm in a
classification setting and so i
introduced nearest neighbor for
classification
but the nearest neighbors can be
extended to regression problems and
even to density estimation for
regression problems
we could imagine that again you would
look for the nearest neighbors
of a particular test datum
but then instead of doing a majority
vote you would
for example take the average of the
nearest neighbor so for example if
you're trying to predict the price of
houses
with the nearest neighbor approach you
would look at your test house
find its closest neighbors according to
the particular k value you chose
and then you would simply take the
average price
of its neighboring houses and that would
give you a prediction for the price of
this new
test house
Voir Plus de Vidéos Connexes
Tout comprendre sur les modèles ARVALIS, au cœur des Outils d’Aide à la Décision - ARVALIS.fr
Best and Worst used Lexus Models to Buy and Lexus Buying Advice
Week 3 -- Capsule 2 -- Linear Classification
Connecting Novelcrafter to OpenRouter - Getting Started
Comprendre les modèles OSI et TCP/IP
Introduction à l’intelligence artificielle - 5 - Apprentissage Automatique
5.0 / 5 (0 votes)