All Machine Learning algorithms explained in 17 min

Infinite Codes
17 Sept 202416:29

Summary

TLDRIn this video, Tim, a seasoned data scientist, offers a comprehensive overview of essential machine learning algorithms, guiding viewers on selecting the right algorithm for their data challenges. The script covers supervised and unsupervised learning, detailing algorithms like linear regression, logistic regression, KNN, SVM, and neural networks. It also touches on clustering with K-means and dimensionality reduction with PCA, providing a foundational understanding of how these algorithms can be applied to real-world problems.

Takeaways

  • πŸ‘¨β€πŸ« The speaker, Tim, is an experienced data scientist who has taught various machine learning algorithms to students.
  • ⏱️ The presentation aims to provide an overview of crucial machine learning algorithms within 17 minutes.
  • πŸ€– Machine learning is a subset of AI that involves algorithms capable of learning from data and making predictions on new, unseen data.
  • πŸ“Š Machine learning is primarily divided into supervised and unsupervised learning, each with its own set of algorithms and applications.
  • 🏠 In supervised learning, algorithms are trained on labeled data to predict outcomes, such as house prices or image classifications.
  • πŸ” Unsupervised learning involves finding patterns in data without any pre-existing labels, like grouping emails into categories.
  • πŸ“ˆ Linear regression is a foundational algorithm in supervised learning, used to model linear relationships between variables.
  • πŸ“Š Logistic regression is used for classification tasks, predicting the probability of an outcome based on input variables.
  • πŸ‘« The K-nearest neighbors (KNN) algorithm makes predictions based on the 'K' closest data points in the feature space.
  • πŸ›‘ Support Vector Machines (SVM) are used for classification and regression by finding the optimal boundary that separates different classes.
  • 🌳 Decision trees and their ensemble methods like Random Forests and boosting are powerful for handling complex decision-making processes.
  • 🧠 Neural networks, including deep learning, are capable of automatically learning complex features from data, making them versatile for a wide range of tasks.

Q & A

  • What is the main goal of the video presented by Tim, the data scientist?

    -The main goal of the video is to provide an intuitive understanding of major machine learning algorithms, helping viewers decide which algorithm is suitable for their specific problem and to stop feeling overwhelmed by the field.

  • How does Tim define machine learning according to the script?

    -Tim defines machine learning as a field of study in artificial intelligence that involves the development and study of statistical algorithms capable of learning from data and generalizing to unseen data, thus performing tasks without explicit instructions.

  • What are the two main subfields of machine learning mentioned in the script?

    -The two main subfields of machine learning mentioned are supervised learning and unsupervised learning.

  • What is the difference between supervised and unsupervised learning as described in the script?

    -Supervised learning involves a dataset with independent variables and a dependent variable that is supposed to be predicted, using known output values or labels for training. Unsupervised learning, on the other hand, involves no known truth about the data, and the algorithm groups data points by similarity without any further instructions.

  • What are the two broad categories within supervised learning according to the script?

    -The two broad categories within supervised learning are regression and classification. Regression predicts a continuous numeric target variable, while classification assigns a discrete categorical label to data points.

  • Can you explain the concept of linear regression as presented in the script?

    -Linear regression is a supervised learning algorithm that attempts to determine a linear relationship between an input variable and an output variable. It fits a linear equation to the data by minimizing the sum of the squares of the distances between data points and the regression line, aiming to minimize prediction errors for new data points.

  • What is logistic regression and how does it differ from linear regression?

    -Logistic regression is a variant of linear regression used for classification tasks. Unlike linear regression, which predicts a continuous output, logistic regression predicts the probability of a categorical output variable using input variables. It fits a sigmoid function to the data to estimate probabilities.

  • How does the K-Nearest Neighbors (KNN) algorithm work, as described in the script?

    -The KNN algorithm is a non-parametric algorithm used for both regression and classification. For a new data point, it predicts the target to be the average of its K nearest neighbors in the feature space. The choice of K, a hyperparameter, affects the model's performance, with smaller values leading to overfitting and larger values leading to underfitting.

  • What is the core concept of the Support Vector Machine (SVM) algorithm?

    -The core concept of the SVM algorithm is to draw a decision boundary that separates data points of the training set as distinctly as possible. It maximizes the margin between different classes, making the decision boundary generalize well and be less sensitive to noise and outliers.

  • What is the difference between ensemble methods like Random Forests and Boosting as described in the script?

    -Random Forests use an ensemble method called bagging, where multiple decision trees are trained on different subsets of the training data and vote on the classification by majority. Boosting, on the other hand, trains models sequentially, with each model focusing on correcting the errors of the previous model. Random Forests are less prone to overfitting and faster to train, while Boosting can achieve higher accuracies but is slower and more prone to overfitting.

  • How does the concept of neural networks relate to the idea of feature engineering as presented in the script?

    -Neural networks extend the idea of feature engineering by implicitly and automatically designing complex features for the model without human guidance. By adding layers of unknown variables, or hidden layers, the network learns to represent complex patterns in the data, such as recognizing shapes or features in images, without explicitly defining these features.

  • What is the primary goal of unsupervised learning algorithms like K-means clustering?

    -The primary goal of unsupervised learning algorithms like K-means clustering is to find underlying structures in the data without any prior knowledge of the data's labels. K-means aims to partition the data into K distinct, non-overlapping subgroups or clusters based on the similarity of data points.

  • How does Principal Component Analysis (PCA) contribute to dimensionality reduction as described in the script?

    -PCA contributes to dimensionality reduction by identifying the directions (principal components) along which the data varies the most and retaining these directions while discarding others that contribute less to the variance. This process reduces the number of features in the dataset, potentially removing redundancy and noise, and can improve the efficiency of machine learning models.

Outlines

00:00

🎯 Introduction to Machine Learning Algorithms

In this introductory paragraph, Tim, a data scientist with over 10 years of experience, presents an overview of machine learning (ML) algorithms, aiming to help viewers select the right algorithm for their problem. He highlights the importance of understanding key ML algorithms and distinguishes between supervised and unsupervised learning. Tim also sets the goal of providing an intuitive grasp of the major algorithms, focusing on how to apply them effectively.

05:01

πŸ” Overview of Supervised and Unsupervised Learning

This section explains the two main types of machine learning: supervised and unsupervised. In supervised learning, a known dataset with input (features) and output (labels) is used to train models for prediction. Examples include house price prediction and object classification (e.g., cat or dog). In unsupervised learning, there are no labels, and the goal is to find patterns or groupings in the data, like clustering emails into categories without prior labels.

10:02

πŸ“Š Supervised Learning: Regression and Classification

Supervised learning is broken down into two subfields: regression and classification. Regression predicts a continuous target variable, like house prices, while classification assigns categorical labels, such as labeling an email as spam or non-spam. Examples include predicting house prices with regression or using features like height and weight to classify emails into different categories. Tim discusses how regression finds relationships, and classification assigns discrete categories based on input variables.

15:03

πŸ“‰ Linear Regression: The Foundation of Machine Learning

Linear regression is introduced as the simplest and most fundamental supervised learning algorithm. It fits a linear equation to data points to predict an output based on input features, minimizing prediction errors. Tim explains the concept of fitting data to a regression line to find relationships, like predicting height based on shoe size. He also touches on how this basic idea extends to more complex models, such as neural networks.

πŸ“ˆ Logistic Regression and K-Nearest Neighbors (KNN)

Logistic regression is introduced as a basic classification algorithm used to predict categorical outputs. The example given is predicting gender based on height and weight, using a sigmoid function to estimate probabilities. Tim then explains K-Nearest Neighbors (KNN), a non-parametric algorithm that classifies data based on the 'K' nearest neighbors. KNN is useful for both regression and classification, with examples of predicting weight or gender based on neighboring data points.

πŸšͺ Support Vector Machines (SVM): Decision Boundaries

Support Vector Machines (SVM) are introduced as supervised algorithms primarily used for classification. The core idea is to create a decision boundary that separates data points into different classes. Tim explains how SVMs work by maximizing the margin between classes, making them effective in high-dimensional spaces. The use of kernel functions, which allow for complex nonlinear decision boundaries, is also highlighted as a powerful feature of SVMs.

πŸ“§ Naive Bayes Classifier and Decision Trees

The Naive Bayes classifier is discussed, which uses Bayes' theorem to calculate probabilities and classify data based on word occurrences in tasks like spam detection. Although the algorithm assumes independence between features, it is computationally efficient for tasks like text classification. Tim then introduces decision trees, which use a series of binary decisions to classify data. The goal is to create 'pure' leaf nodes by splitting data in a way that minimizes misclassifications.

🌲 Random Forests and Ensemble Learning

Tim explains how decision trees are combined into ensemble algorithms like Random Forests and boosting. In Random Forests, multiple decision trees are trained on random subsets of data, and their predictions are combined to improve accuracy. Boosting, on the other hand, trains models sequentially, with each new model trying to correct the errors of the previous ones. Boosted trees can achieve higher accuracy but are more prone to overfitting and require more training time.

πŸ€– Neural Networks: The Power of Deep Learning

Neural networks are introduced, starting with the concept of logistic regression, then extending to multilayer perceptrons. Neural networks use hidden layers of variables to extract features and make predictions, automating feature engineering. As more layers are added, this becomes deep learning, allowing the network to learn complex patterns and relationships, such as recognizing digits from pixel data in images. Tim emphasizes how neural networks excel at tasks where manual feature engineering is difficult.

πŸ” Clustering: Unsupervised Learning and K-Means

Clustering is discussed as an unsupervised learning method that aims to find natural groupings in data without predefined labels. Tim uses examples to explain the difference between clustering and classification. The K-Means algorithm, which groups data by minimizing the distance to randomly chosen cluster centers, is introduced. The challenge of choosing the right number of clusters is highlighted, with references to other clustering methods like hierarchical clustering and DBSCAN.

πŸ“‰ Dimensionality Reduction and Principal Component Analysis (PCA)

Dimensionality reduction is introduced as a way to simplify datasets by reducing the number of features while retaining important information. Tim explains Principal Component Analysis (PCA), which identifies directions (principal components) that capture the most variance in the data. This can help reduce redundancy in large datasets and improve the efficiency and robustness of machine learning models. He uses the example of combining height and length into a single 'shape' feature to illustrate the process.

πŸŽ“ Conclusion and Machine Learning Cheat Sheet

In the final section, Tim wraps up the video by providing a summary of the algorithms discussed and a helpful cheat sheet from Scikit-learn to guide viewers in choosing the right algorithm for their specific problem. He also mentions a roadmap for learning machine learning, directing viewers to another video for further learning resources.

Mindmap

Keywords

πŸ’‘Machine Learning

Machine learning is a subset of artificial intelligence that focuses on the development of algorithms capable of learning from data and improving from experience. In the video, machine learning is the central theme, with the speaker aiming to provide an intuitive understanding of various algorithms and their applications. The script mentions that machine learning algorithms can learn from data and generalize to unseen data, which is crucial for tasks such as image recognition or predicting house prices.

πŸ’‘Supervised Learning

Supervised learning is a type of machine learning where the model is trained on labeled data. The video explains that in supervised learning, there is a dataset with features (independent variables) and a target variable (dependent variable) that the model aims to predict. The script uses examples like predicting house prices based on features like square footage and location, illustrating how supervised learning can be used for regression or classification tasks.

πŸ’‘Unsupervised Learning

Unsupervised learning is another branch of machine learning where the model tries to find patterns in data without any pre-existing labels. The video script contrasts unsupervised learning with supervised learning, explaining that in unsupervised learning, the algorithm is given no truth about the data and must discover patterns on its own, such as grouping similar items into clusters.

πŸ’‘Linear Regression

Linear regression is a fundamental algorithm in machine learning used for predicting a continuous outcome based on one or more features. The video script describes linear regression as trying to find a linear relationship between input and output variables, fitting a line that minimizes the sum of squared distances from the data points to the line. It's used for tasks like predicting house prices based on various features.

πŸ’‘Logistic Regression

Logistic regression is a classification algorithm used to predict categorical outcomes. Unlike linear regression, which predicts a continuous variable, logistic regression predicts the probability of an outcome belonging to a particular category. The video script uses the example of predicting the gender of a person based on height and weight, showcasing how logistic regression can be used to model binary outcomes.

πŸ’‘K Nearest Neighbors (KNN)

KNN is a simple, non-parametric algorithm used for both classification and regression tasks. The video script explains KNN by stating that it predicts the target for a new data point based on the average of its K nearest neighbors. The choice of K is a hyperparameter that can greatly affect the model's performance, with the script noting that too small a value can lead to overfitting, while too large a value can lead to underfitting.

πŸ’‘Support Vector Machine (SVM)

SVM is a supervised learning algorithm used for classification and regression. The video script describes SVM as drawing a decision boundary to separate data points as distinctly as possible. It uses the concept of 'support vectors', which are the data points closest to the decision boundary, to classify new data points. SVMs can handle high-dimensional data and utilize kernel functions to create complex, nonlinear decision boundaries.

πŸ’‘Neural Networks

Neural networks are a set of algorithms modeled loosely after the human brain that are designed to recognize patterns. The video script explains that neural networks can automatically learn and represent complex features without explicit programming. They are particularly powerful for tasks like image recognition, where the script uses the example of classifying handwritten digits, highlighting how neural networks can implicitly learn to recognize complex features like lines and shapes.

πŸ’‘Ensemble Methods

Ensemble methods in machine learning involve combining multiple models to improve predictive performance. The video script discusses two types of ensemble methods: bagging, where multiple models are trained on different subsets of the data, and boosting, where models are trained sequentially to correct the errors of previous models. A notable example from the script is the random forest, which is an ensemble of decision trees that vote on the classification of data points.

πŸ’‘Dimensionality Reduction

Dimensionality reduction is a technique used to reduce the number of features in a dataset, often to improve the performance of machine learning algorithms. The video script explains that algorithms like Principal Component Analysis (PCA) find directions that retain most of the data's variance, effectively reducing the number of features while preserving as much information as possible. This can help in making the model more efficient and reducing overfitting.

πŸ’‘K Means Clustering

K means clustering is a popular unsupervised learning algorithm used for clustering data into a predetermined number of groups. The video script describes the process of K means clustering, where initial cluster centers are chosen, and data points are assigned to the nearest center. The script explains that the centers are then recalculated based on the assigned data points, and this process is repeated until the cluster centers stabilize, resulting in distinct clusters of data points.

Highlights

Overview of the most important machine learning algorithms.

Simple strategy for picking the right algorithm for your problem.

Machine learning is divided into supervised and unsupervised learning.

Supervised learning involves predicting a target variable based on features.

Unsupervised learning involves finding patterns without known labels.

Regression predicts a continuous numeric target variable.

Classification assigns a discrete categorical label to data points.

Linear regression determines a linear relationship between input and output variables.

Logistic regression is used for predicting categorical output variables.

K-Nearest Neighbors (KNN) algorithm predicts based on the average of nearest neighbors.

Support Vector Machine (SVM) creates a decision boundary to classify data points.

Naive Bayes classifier is based on Bayes' theorem and is used for text classification.

Decision trees create a series of yes/no questions to partition data.

Ensemble methods like Random Forest and Boosting combine multiple models for better predictions.

Neural networks implicitly create complex features for better predictions.

Deep learning involves multiple layers of hidden features for complex data representation.

K-means clustering is a method for finding groups in unlabeled data.

Dimensionality reduction techniques like PCA help in reducing data complexity.

Practical advice on choosing the right algorithm for different types of machine learning problems.

Transcripts

play00:00

in the next 17 minutes I will give you

play00:01

an overview of the most important

play00:03

machine learning algorithms to help you

play00:05

decide which one is right for your

play00:06

problem my name is Tim and I have been a

play00:08

data scientist for over 10 years and

play00:10

taught all of these algorithms to

play00:12

hundreds of students in real life

play00:13

machine learning boot camps there is a

play00:15

simple strategy for picking the right

play00:17

algorithm for your problem in 17 minutes

play00:19

you will know how to pick the right one

play00:20

for any problem and get a basic

play00:22

intuition of each algorithm and how they

play00:24

relate to each other my goal is to give

play00:25

as many of you as possible an intuitive

play00:27

understanding of the major machine

play00:28

learning algorithms to make you stop

play00:30

feeling overwhelmed according to

play00:32

Wikipedia machine learning is a field of

play00:34

study in artificial intelligence

play00:35

concerned with the development and study

play00:37

of statistical algorithms that can learn

play00:39

from data and generalize to unseen data

play00:41

and thus perform tasks without explicit

play00:43

instructions much of the recent

play00:44

advancements in AI are driven by neural

play00:46

networks which I hope to give you an

play00:48

intuitive understanding of by the end of

play00:49

this video Let's divide machine learning

play00:52

into its subfields generally machine

play00:54

learning is divided into two areas

play00:56

supervised learning and unsupervised

play00:58

learning supervised learning is when we

play01:00

have a data set with any number of

play01:01

independent variables also called

play01:03

features or input variables and a

play01:05

dependent variable also called Target or

play01:07

output variable that is supposed to be

play01:09

predicted we have a so-called training

play01:11

data set where we know the True Values

play01:13

for the output variable also called

play01:15

labels that we can train our algorithm

play01:16

on to later predict the output variable

play01:18

for new unknown data examples could be

play01:21

predicting the price of a house the

play01:23

output variable based on features of the

play01:25

house say square footage location year

play01:27

of construction Etc categorizing an

play01:30

object as a cat or a dog the output

play01:32

variable or label based on features of

play01:33

the object say height weight size of the

play01:36

ears color of the eyes Etc unsupervised

play01:38

learning is basically any learning

play01:40

problem that is not supervised so where

play01:42

no truth about the data is known so

play01:43

where a supervised algorithm would be

play01:45

like showing a little kid what a typical

play01:47

cat looks like and what a typical dog

play01:48

looks like and then giving it a new

play01:50

picture and asking it what animal it

play01:52

sees an unsupervised algorithm would be

play01:54

giving a kid with no idea of what cats

play01:56

and dogs are a pile of pictures of

play01:57

animals and asking it to group by

play01:59

similarity without any further

play02:00

instructions examples of unsupervised

play02:02

problems might be to sort all of your

play02:04

emails into three unspecified categories

play02:06

which you can then later inspect and

play02:08

name as you wish the algorithm will

play02:10

decide on its own how it will create

play02:12

those categories also called clusters

play02:14

let's start with supervised learning

play02:16

arguably the bigger and more important

play02:17

branch of machine learning there are

play02:19

broadly two subcategories in regression

play02:21

we want to predict a continuous numeric

play02:23

Target variable for a given input

play02:25

variable using the example from before

play02:27

it could be predicting the price of a

play02:28

house given any number features of a

play02:30

house and determining their relationship

play02:32

to the final price of the house we might

play02:34

for example find out that square footage

play02:36

is directly proportional to the price

play02:37

linear dependence but that the age of

play02:40

the house has no influence on the price

play02:41

of the house in classification we try to

play02:43

assign a discrete categorical label also

play02:46

called a class to a data point for

play02:48

example we may want to assign the label

play02:49

spam or no spam to an email based on its

play02:51

content sender and so on but we could

play02:53

also have more than two classes for

play02:55

example junk primary social promotions

play02:57

and updates as Gmail does by default now

play02:59

let's dive into the actual algorithms

play03:01

starting with the mother of all machine

play03:03

learning algorithms linear regression in

play03:05

general supervised learning algorithms

play03:07

try to determine the relationship

play03:09

between two variables we try to find the

play03:11

function that Maps one to the other

play03:12

linear regression in its simplest form

play03:15

is trying to determine a linear

play03:16

relationship between two variables

play03:18

namely the input and the output we want

play03:20

to fit a linear equation to the data by

play03:22

minimizing the sum of squares of the

play03:24

distances between data points and the

play03:26

regression Line This simply minimizes

play03:28

the average distance of the real data to

play03:30

our predictive model in this case the

play03:31

regression line and should therefore

play03:33

minimize prediction errors for new data

play03:35

points a simple example of a linear

play03:36

relationship might be the height and

play03:38

shoe size of a person where the

play03:39

regression fit might tell us that for

play03:41

every one unit of shoe size increase the

play03:43

person will be on average 2 Ines taller

play03:45

you can make your model more complex and

play03:47

fit multi-dimensional data to an output

play03:49

variable in the example of the shoe size

play03:50

you might for example want to include

play03:52

the gender age and ethnicity of the

play03:54

person to get an even better model many

play03:55

of the very fancy machine learning

play03:57

algorithms including neural networks are

play03:59

just extensions of this very simple idea

play04:02

as I will show you later in the video

play04:04

logistic regression is a variant of

play04:05

linear regression and probably the most

play04:07

basic classification algorithm instead

play04:09

of fitting a line to two numerical

play04:11

variables with a presumably linear

play04:13

relationship you now try to predict a

play04:15

categorical output variable using

play04:17

categorical or numerical input variables

play04:19

let's look at an example we now want to

play04:21

predict one of two classes for example

play04:23

the gender of a person based on height

play04:25

and weight so a linear regression

play04:27

wouldn't make much sense anymore instead

play04:29

of fitting a line to the data we now fit

play04:31

a so-called sigmoid function to the data

play04:33

which looks like this the equation will

play04:35

now not tell us about a linear

play04:36

relationship between two variables but

play04:38

will now conveniently tell us the

play04:40

probability of a data point falling into

play04:41

a certain class given the value of the

play04:43

input variable so for example the

play04:45

likelihood of an adult person with a

play04:47

height of 180 cm being a man would be

play04:50

80% this is completely made up of course

play04:53

the K nearest neighbors algorithm or KNN

play04:56

is a very simple and intuitive algorithm

play04:57

that can be used for both regression and

play04:59

class ification it is a so-called

play05:01

non-parametric algorithm the name means

play05:03

that we don't try to fit any equations

play05:05

and thus find any parameters of a model

play05:07

so no true model fitting is necessary

play05:09

the idea of KNN is simply that for any

play05:11

given new data point we will predict the

play05:12

target to be the average of its K

play05:14

nearest neighbors while this might seem

play05:15

very simple this is actually a very

play05:17

powerful predictive algorithm especially

play05:19

when relationships are more complicated

play05:21

than a simple linear relationship in a

play05:22

classification example we might say that

play05:24

the gender of a person will be the same

play05:26

as the majority of the five people

play05:28

closest in weight and height to the

play05:29

person in question in a regression

play05:31

example we might say that the weight of

play05:33

a person is the average weight of the

play05:34

three people closest in height and of

play05:36

chest circumference this makes a ton of

play05:38

intuitive sense you might realize that

play05:40

the number three seems a bit arbitrary

play05:41

and it is K is called a hyperparameter

play05:43

of the algorithm and choosing the right

play05:45

K is an art choosing a very small number

play05:47

of K say one or two will lead to your

play05:50

model predicting your training data set

play05:51

very well but not generalizing well to

play05:53

unseen data this is called overfitting

play05:56

choosing a very large number say 1,000

play05:58

will lead to a worst fit over overall

play05:59

this is called underfitting the best

play06:01

number is somewhere in between and

play06:03

Depends a lot on the problem at hand

play06:04

methods for finding the right

play06:05

hyperparameters include cross validation

play06:07

but are beyond the scope of this video

play06:10

support Vector machine is a supervised

play06:12

machine learning algorithm originally

play06:13

designed for classification tasks but it

play06:16

can also be used for regression tasks

play06:18

the core concept of the algorithm is to

play06:19

draw a decision boundary between data

play06:21

points that separates data points of the

play06:23

training set as well as possible as the

play06:25

name suggests a new unseen data point

play06:27

will be classified according to where it

play06:29

falls with respect to the decision

play06:30

boundary let's take this arbitrary

play06:32

example of trying to classify animals by

play06:34

their weight and the length of their

play06:36

nose in this simple case of trying to

play06:38

classify cats and elephants the decision

play06:40

boundary is a straight line the svm

play06:42

algorithm tries to find the line that

play06:43

separates the classes with the largest

play06:45

margin possible that is maximizing the

play06:47

space between the different classes this

play06:49

makes the decision boundary generalize

play06:50

well and less sensitive to noise and

play06:52

outliers in the training data the

play06:53

so-called support vectors are the data

play06:55

points that sit on the edge of the

play06:56

margin knowing the support vectors is

play06:58

enough to classify new data points which

play07:00

often makes the algorithm very memory

play07:02

efficient one of the benefits of SPM is

play07:04

that it is very powerful in high

play07:06

Dimensions that is if the number of

play07:08

features is large compared to the size

play07:09

of the data in those higher dimensional

play07:11

cases the decision boundary is called a

play07:13

hyperplane another feature that makes

play07:14

svms extremely powerful is the use of

play07:16

so-called kernel functions which allow

play07:18

for the identification of Highly complex

play07:20

nonlinear decision boundaries kernel

play07:22

functions are an implicit way to turn

play07:23

your original features into new more

play07:25

complex features using the so-called

play07:27

kernel trick which is beyond the scope

play07:29

of this video this allows for efficient

play07:31

creation of nonlinear decision

play07:32

boundaries by creating complex new

play07:34

features such as weight divided by

play07:36

height squared also called the BMI this

play07:38

is called implicit feature engineering

play07:40

neural networks take the idea of

play07:41

implicit feature engineering to the next

play07:43

level as I will explain later possible

play07:45

kernel functions for svms are the linear

play07:47

the polinomial the RBF and the sigmoid

play07:49

kernel another fairly simple classifier

play07:51

is the naive Bas classifier that gets

play07:53

its name from B theorem which looks like

play07:55

this I believe it's easiest to

play07:57

understand naive Bay with an example use

play07:58

case that it is often used for spam

play08:00

filters we can train our algorithm with

play08:03

a number of spam and non-spam emails and

play08:04

count the occurrences of different words

play08:06

in each class and thereby calculate the

play08:08

probability of certain words appearing

play08:09

in spam emails and non-spam emails we

play08:11

can then quickly classify a new email

play08:13

based on the words it contains by by

play08:15

using base theorem we simply multiply

play08:17

the different probabilities of all words

play08:19

in the email together this algorithm

play08:21

makes the false assumption that the

play08:22

probabilities of the different words

play08:24

appearing are independent of each other

play08:25

which is why we call this classifier

play08:27

naive this makes it very computation

play08:29

Ally efficient while still being a good

play08:31

approximation for many use cases such as

play08:33

spam classification and other text-based

play08:35

classification tasks decision trees are

play08:38

the basis of a number of more complex

play08:39

supervised learning algorithms in its

play08:41

simplest form a decision tree looks

play08:43

somewhat like this the decision tree is

play08:45

basically a series of yes no questions

play08:47

that allow us to partition a data set in

play08:49

several Dimensions here is an example

play08:51

decision tree for classifying people

play08:53

into high and lowrisk patients for heart

play08:54

attacks the goal of the decision tree

play08:56

algorithm is to create so-called Leaf

play08:58

nodes at the bottom of the tree that are

play09:00

as pure as possible meaning instead of

play09:01

randomly splitting the data we try to

play09:03

find splits that lead to the resulting

play09:04

groups or leaves to be as pure as

play09:07

possible which is to say that as few

play09:08

data points as possible are

play09:10

misclassified while this might seem like

play09:11

a very basic and simple algorithm which

play09:14

it is we can turn it into a very

play09:15

powerful algorithm by combining many

play09:17

decision trees together combining many

play09:19

simple models to a more powerful complex

play09:21

model is called an ensemble algorithm

play09:23

one form of ensembling is bagging where

play09:25

we train multiple models on different

play09:26

subsets of the training data using a

play09:28

method called bootstrap

play09:30

a famous version of this idea is called

play09:31

a random Forest where many decision

play09:33

trees vote on the classification of your

play09:35

data by majority vote of the different

play09:37

trees in the random Forest random

play09:39

forests are very powerful estimators

play09:41

that can be used both for classification

play09:42

and regression the randomness comes from

play09:44

randomly excluding features for

play09:46

different trees in the forest which

play09:48

prevents overfitting and makes it much

play09:49

more robust because it removes

play09:51

correlation between the trees another

play09:53

type of Ensemble method is called

play09:54

boosting where instead of running many

play09:56

decision trees in parallel like for

play09:58

random forests we train models in

play10:00

sequence where each model focuses on

play10:02

fixing the errors made by the previous

play10:03

model we combine a series of weak models

play10:05

in sequence thus becoming a strong model

play10:07

because each sequential model tries to

play10:09

fix the errors of the previous model

play10:10

boosted trees often get to higher

play10:12

accuracies than random forests but are

play10:14

also more prone to overfitting its

play10:16

sequential nature makes it slower to

play10:17

train than random forests famous

play10:20

examples of boosted trees are Ada boost

play10:22

gradient boosting and XG boost the

play10:23

details of which are beyond the scope of

play10:25

this video now let's get to the reigning

play10:27

king of AI neural networks to to

play10:29

understand neural networks let's look at

play10:30

logistic regression again say we have a

play10:32

number of features and are trying to

play10:34

predict a target class the features

play10:36

might be pixel intensities of a digital

play10:37

image and the target might be

play10:39

classifying the image as one of the

play10:40

digits from 0 to 9 now for this

play10:42

particular case you might see why this

play10:44

might be difficult to do with logistic

play10:45

regression because say the number one

play10:47

doesn't look the same when different

play10:48

people write it and even if the same

play10:50

person writes it several times it will

play10:51

look slightly different each time and it

play10:53

won't be the exact same pixels

play10:54

illuminated for every instance of the

play10:56

number one all of the instances of the

play10:58

number one have commonality however like

play11:00

they all have a dominating vertical line

play11:01

and usually no Crossing Lines as other

play11:03

digits might have and usually there are

play11:05

no circular shapes in the number one as

play11:07

there would be in the number eight or or

play11:08

nine however the computer doesn't

play11:10

initially know about these more complex

play11:12

features but only the pixel intensities

play11:14

we could manually engineer these

play11:16

features by measuring some of these

play11:17

things and explicitly adding them as new

play11:20

features but artificial neural networks

play11:22

similarly to using a kernel function

play11:23

with a support Vector machine are

play11:25

designed to implicitly and automatically

play11:26

design these features for us without any

play11:28

guidance from humans we do this by

play11:30

adding additional layers of unknown

play11:32

variables between the input and output

play11:33

variables in its simplest form this is

play11:36

called a single layer percep chop which

play11:38

is basically just a multi-feature

play11:39

regression task now if we add a hidden

play11:41

layer the hidden variables in the middle

play11:42

layer represent some hidden unknown

play11:44

features and instead of predicting the

play11:46

target variable directly we try to

play11:48

predict these hidden features with our

play11:49

input features and then try to predict

play11:51

the target variables with our new hidden

play11:53

features in our specific example we

play11:55

might be able to say that every time

play11:57

several pixels are illuminated next to

play11:58

each other they represent a horizontal

play12:00

line which can be a new feature to try

play12:02

and predict the digit in question even

play12:04

though we never explicitly defined a

play12:05

feature called horizontal line This is a

play12:07

much simplified view of what is actually

play12:09

going on but hopefully this gets the

play12:11

point across we don't usually know what

play12:13

the hidden features represent we just

play12:15

train the neural network to predict the

play12:17

final Target as well as possible the

play12:18

hidden features we can Design This Way

play12:20

are limited in the case of the single

play12:22

hidden layer but what if we add a layer

play12:23

and have the hidden layer predict

play12:25

another hidden layer what if we now had

play12:26

even more layers this is called Deep

play12:28

learning and can result in very complex

play12:30

hidden features so that might represent

play12:32

all kinds of complex information in the

play12:34

pictures like the fact that there is a

play12:36

face in the picture however we will

play12:38

usually not know what the hidden

play12:39

features mean we just know that they

play12:41

result in good predictions all we have

play12:43

talked about so far is supervised

play12:44

learning where we wanted to predict a

play12:46

specific Target variable using some

play12:48

input variables however sometimes we

play12:50

don't have anything specific to predict

play12:52

and just want to find some underlying

play12:53

structure in our data that's where

play12:55

unsupervised learning comes in a very

play12:57

common unsupervised problem is

play12:59

clustering it's easy to confuse

play13:00

clustering with classification but they

play13:02

are conceptually very different

play13:04

classification is when we know the

play13:05

classes we want to predict and have

play13:07

training data with true labels available

play13:09

shown as colors here like pictures of

play13:11

cats and dogs clustering is when we

play13:13

don't have any labels and want to find

play13:15

unknown clusters just by looking at the

play13:16

overall structure of the data and trying

play13:18

to find potential clusters in the data

play13:20

for example we might look at a

play13:21

two-dimensional data set that looks like

play13:23

this any human will probably easily see

play13:25

three clusters here but it's not always

play13:27

as straightforward as your data might

play13:29

might also look like this we don't know

play13:31

how many clusters there are because the

play13:32

problem is unsupervised the most famous

play13:34

clustering algorithm is called K means

play13:36

clustering just like for KNN K is a

play13:39

hyperparameter and stands for the number

play13:41

of clusters you are looking for finding

play13:43

the right number of clusters again is an

play13:45

art and has a lot to do with your

play13:46

specific problem and some trial and

play13:48

error in domain knowledge might be

play13:49

required this is beyond the scope of

play13:51

this video K means is very simple you

play13:54

start by randomly selecting centers for

play13:56

your K clusters and assigning all data

play13:58

points to the cluster center closest to

play13:59

them the Clusters here are shown in blue

play14:01

and green you then recalculate the

play14:03

cluster centers based on the data points

play14:05

now assigned to them you can see the

play14:07

centers moving closer to the actual

play14:08

clusters you then assign the data points

play14:11

again to the new cluster centers

play14:12

followed by recalculating the cluster

play14:14

centers you repeat this process until

play14:17

the centers of the Clusters have

play14:18

stabilized while K means is the most

play14:20

famous and most common clustering

play14:21

algorithm other algorithms exist

play14:24

including some where you don't need to

play14:25

specify the number of clusters like

play14:27

hierarchical clustering and DB scan

play14:30

which can find clusters of arbitrary

play14:31

shape but I won't discuss them here the

play14:33

last type of algorithm I will leave you

play14:35

with is dimensionality reduction the

play14:37

idea of dimensionality reduction is to

play14:39

reduce the number of features or

play14:40

dimensions of your data set keeping as

play14:42

much information as possible usually

play14:44

this group of algorithms does this by

play14:45

finding correlations between existing

play14:47

features and removing potentially

play14:49

redundant Dimensions without losing much

play14:51

information for example do you really

play14:53

need a picture in high resolution to

play14:55

recognize the airplane in the picture or

play14:56

can you reduce the number of pixels in

play14:58

the image as such dimensionality

play15:00

reduction will give you information

play15:01

about the relationships within your

play15:03

existing features and it can also be

play15:04

used as a pre-processing step in your

play15:06

supervised learning algorithm to reduce

play15:08

the number of features in your data set

play15:10

and make the algorithm more efficient

play15:11

and robust an example algorithm is

play15:14

principal component analysis or PCA

play15:17

let's say we are trying to predict types

play15:18

of fish based on several features like

play15:20

length height color and number of teeth

play15:23

when looking at the correlations of the

play15:24

different features we might find that

play15:26

height and length are strongly

play15:27

correlated and including both both won't

play15:29

help the algorithm much and might in

play15:31

fact hurt it by introducing noise we can

play15:33

simply include a shape feature that is a

play15:35

combination of the two this is actually

play15:36

extremely common in large data sets and

play15:38

allows us to reduce the number of

play15:40

features dramatically and still get good

play15:41

results PCA does this by finding the

play15:44

directions in which most variance in the

play15:45

data set is retained in this example the

play15:48

direction of most variant is a diagonal

play15:50

this is called the first principal

play15:51

component or PC and can become our new

play15:54

shape feature the second principal

play15:55

component is orthogonal to the first and

play15:57

only explains a small fr C of the

play15:59

variant of the data set and can thus be

play16:01

excluded from our data set in this case

play16:03

in large data sets we can do this for

play16:05

all features and rank them by explain

play16:07

variants and exclude any principal

play16:08

components that don't contribute much to

play16:10

the variant and thus wouldn't help much

play16:12

in our ml model this was all common

play16:14

machine learning algorithms explained if

play16:16

you are overwhelmed and don't know which

play16:17

algorithm you need here is a great cheat

play16:19

sheet by syit learn that will help you

play16:21

decide which algorithm is right for

play16:22

which type of problem if you want a road

play16:24

map on how to learn machine learning

play16:26

check out my video on that

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Machine LearningAlgorithmsData ScienceSupervised LearningUnsupervised LearningNeural NetworksClassificationRegressionData AnalysisAI