All Learning Algorithms Explained in 14 Minutes

CinemaGuess
26 Feb 202414:09

Summary

TLDRThis script offers a comprehensive overview of various machine learning algorithms, including linear regression, support vector machines, naive Bayes, logistic regression, K-nearest neighbors, decision trees, random forests, gradient boosted decision trees, and clustering techniques like K-means, DBSCAN, and PCA. It explains the purpose, methodology, and applications of each algorithm, highlighting their strengths and limitations in solving classification, regression, and clustering tasks.

Takeaways

  • 📘 An algorithm is a set of instructions for a computer to perform calculations or problem-solving operations, not an entire program or code.
  • 📊 Linear regression is a supervised learning algorithm used to model the relationship between a continuous target variable and one or more independent variables by fitting a linear equation to the data.
  • 🛰 Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression tasks, distinguishing classes by drawing a decision boundary that maximizes the margin from support vectors.
  • 🤖 Naive Bayes is a supervised learning algorithm for classification that assumes features are independent and calculates the probability of a class given a set of feature values.
  • 📈 Logistic regression is a supervised learning algorithm for binary classification problems, using the logistic function to map input values to a probability between 0 and 1.
  • 👫 K-Nearest Neighbors (KNN) is a supervised learning algorithm for both classification and regression that predicts the class or value of a data point based on the majority vote or mean of the closest points.
  • 🌳 Decision trees work by iteratively asking questions to partition data, aiming to increase the purity of nodes to make the most informative splits and avoid overfitting.
  • 🌲 Random Forest is an ensemble of decision trees that use bagging to reduce the risk of overfitting and improve accuracy through the majority vote or mean values of multiple trees.
  • 🌿 Gradient Boosted Decision Trees (GBDT) is an ensemble algorithm that combines individual decision trees in series, with each tree focusing on the errors of the previous one, to achieve high efficiency and accuracy.
  • 🔑 K-means clustering is an unsupervised learning method that partitions data into K clusters based on the similarity of data points, using an iterative process to find centroids and assign points to clusters.
  • 🏞️ DBSCAN is a density-based clustering algorithm that can find arbitrary shaped clusters and detect outliers without requiring a predetermined number of clusters, using neighborhood distance and minimum points parameters.

Q & A

  • What is an algorithm in the context of computer science?

    -An algorithm is a set of commands that a computer must follow to perform calculations or problem-solving operations. It is a finite set of instructions carried out in a specific order to perform a particular task and is not an entire program or code but rather a simple logic to a problem.

  • How does linear regression model the relationship between variables?

    -Linear regression models the relationship between a continuous target variable and one or more independent variables by fitting a linear equation to the data. It finds the best regression line by minimizing the sum of squares of the distances between the data points and the line.

  • What is the primary task of a Support Vector Machine (SVM)?

    -A Support Vector Machine (SVM) is a supervised learning algorithm primarily used for classification tasks. It distinguishes classes by drawing a decision boundary in multidimensional space, aiming to maximize the distance to support vectors to ensure good generalization.

  • How does the Naive Bayes algorithm make classification decisions?

    -The Naive Bayes algorithm, a supervised learning algorithm for classification, assumes that features are independent of each other. It calculates the probability of a class given a set of feature values using Bayes' theorem, relying on the independence assumption to make predictions quickly.

  • What is the logistic function used for in logistic regression?

    -The logistic function, also known as the sigmoid function, is used in logistic regression to map any real-valued number to a value between 0 and 1. It is used to perform binary classification tasks by calculating probabilities that can be thresholded to classify data points.

  • How does the K-Nearest Neighbors (KNN) algorithm determine the class of a data point?

    -The K-Nearest Neighbors (KNN) algorithm determines the class of a data point based on the majority voting principle of the K closest points. For regression, it takes the mean value of the K closest points, emphasizing the importance of choosing an optimal K value to avoid overfitting or underfitting.

  • What is the main advantage of decision trees in handling data?

    -Decision trees have the advantage of being easy to interpret and visualize. They work by iteratively asking questions to partition data, aiming to increase the purity of nodes with each split, which makes them suitable for both classification and regression tasks without the need for feature normalization or scaling.

  • How does Random Forest differ from a single decision tree?

    -Random Forest is an ensemble of many decision trees built using bagging, where each tree operates as a parallel estimator. It reduces the risk of overfitting and generally provides higher accuracy than a single decision tree, as it aggregates the results from multiple uncorrelated trees.

  • What is the boosting method used in Gradient Boosted Decision Trees (GBDT)?

    -Gradient Boosted Decision Trees (GBDT) use a boosting method that combines individual decision trees sequentially to achieve a strong learner. Each tree focuses on the errors of the previous one, making GBDT highly efficient and accurate for both classification and regression tasks.

  • How does K-means clustering determine the number of clusters?

    -K-means clustering does not automatically determine the number of clusters; it requires the number of clusters (K) to be predetermined by the user. The algorithm iteratively assigns data points to clusters and updates centroids until convergence is reached, aiming to group similar data points together.

  • What are the two key parameters of DBSCAN clustering, and how do they work?

    -DBSCAN clustering has two key parameters: EPS, which defines the neighborhood distance, and MinPts, which is the minimum number of points required to form a cluster. A point is considered a core point if at least MinPts number of points are within its EPS radius, a border point if fewer points are present, and an outlier if it is not reachable from any core point.

  • What is the main goal of Principal Component Analysis (PCA)?

    -The main goal of Principal Component Analysis (PCA) is to reduce the dimensionality of a dataset by deriving new features, called principal components, that explain as much variance within the original data as possible while using fewer features than the original dataset.

Outlines

00:00

🤖 Overview of Machine Learning Algorithms

An algorithm is a set of commands that a computer follows to perform tasks. Linear regression models the relationship between a continuous target variable and independent variables by fitting a linear equation to data, minimizing the sum of squares of distances between data points and the regression line. Support Vector Machines (SVM) draw decision boundaries to classify data, maximizing the distance to support vectors. Naive Bayes, based on Bayes' theorem, assumes feature independence and is fast but less accurate. Logistic regression is used for binary classification, employing a logistic function to map values between 0 and 1. K-Nearest Neighbors (KNN) uses nearby data points to classify or predict values, with optimal K value crucial for avoiding overfitting or underfitting.

05:01

🔍 KNN and Decision Trees

KNN determines the class of a data point based on majority voting of its neighbors, with optimal K value essential to balance specificity and generalization. Decision trees partition data by iteratively asking questions, aiming to increase node purity and predictiveness. Overfitting occurs if the tree becomes too specific, requiring ensemble methods like Random Forests, which use multiple decision trees for improved accuracy and reduced overfitting. Random Forests employ bagging, bootstrapping, and feature randomness to achieve uncorrelated trees and higher accuracy.

10:02

🌲 Boosting and Clustering

Gradient Boosted Decision Trees (GBDT) use boosting to combine weak learners into a strong model, with each tree correcting errors of the previous ones. K-Means clustering partitions data into clusters by iteratively adjusting centroids based on data point distances. DBSCAN clustering finds arbitrary shaped clusters and detects outliers based on density. Principal Component Analysis (PCA) reduces dimensionality by deriving new features that retain most of the original data's variance, often used as a pre-processing step for supervised learning algorithms.

Mindmap

Keywords

💡Algorithm

An algorithm is a set of commands that a computer follows to perform calculations or solve problems. In the context of the video, it refers to various machine learning algorithms used for different tasks like regression, classification, and clustering.

💡Linear Regression

Linear regression is a supervised learning algorithm that models the relationship between a continuous target variable and one or more independent variables by fitting a linear equation to the data. It is illustrated in the video with a chart of data points and a regression line that minimizes the sum of squares of the distances between the points and the line.

💡Support Vector Machine (SVM)

SVM is a supervised learning algorithm used mainly for classification but also for regression tasks. It works by drawing a decision boundary in an N-dimensional space that maximizes the distance to the nearest data points (support vectors). The video explains how SVM handles high-dimensional data efficiently.

💡Naive Bayes

Naive Bayes is a supervised learning algorithm used for classification tasks. It assumes that features are independent, which simplifies computation but can reduce accuracy. The video describes how it uses Bayes' theorem to calculate the probability of a class given a set of features.

💡Logistic Regression

Logistic regression is a supervised learning algorithm used for binary classification problems. It uses the logistic function to map input values to probabilities between 0 and 1, making it suitable for tasks like spam detection and customer churn prediction, as explained in the video.

💡K-Nearest Neighbors (KNN)

KNN is a supervised learning algorithm used for both classification and regression. It determines the class or value of a data point based on the majority class or mean value of its nearest neighbors. The video highlights the importance of choosing an optimal K value to balance specificity and generalization.

💡Decision Tree

A decision tree is a model that uses a tree-like structure to make decisions by asking a series of questions to partition data. The video explains how decision trees can overfit data if they become too specific, and how they are used in ensemble methods like random forests to improve accuracy.

💡Random Forest

Random forest is an ensemble method that combines multiple decision trees using bagging to improve accuracy and reduce overfitting. Each tree is trained on a random subset of the data, and the final prediction is based on the majority vote or average of the trees' predictions. The video describes its advantages over single decision trees.

💡Gradient Boosted Decision Trees (GBDT)

GBDT is an ensemble method that combines decision trees in a sequential manner to correct errors made by previous trees. It is more accurate than random forests but requires careful tuning to avoid overfitting. The video discusses its efficiency in both classification and regression tasks.

💡K-Means Clustering

K-means clustering is an unsupervised learning algorithm that partitions data into K clusters by minimizing the distance between data points and the cluster centroids. The video explains the iterative process of selecting centroids, assigning points to clusters, and recalculating centroids until convergence.

Highlights

An algorithm is a finite set of instructions for a computer to perform calculations or problem-solving operations.

Linear regression models the relationship between a continuous target variable and one or more independent variables by fitting a linear equation to the data.

Support Vector Machine (SVM) is used for classification and regression tasks by drawing a decision boundary to distinguish classes.

SVM maximizes the distance to support vectors when drawing the decision boundary to avoid sensitivity to noise and improve generalization.

Naive Bayes classifier assumes feature independence and uses Bayes' theorem for classification tasks.

Logistic regression is used for binary classification problems and is based on the logistic function to map real values to a probability between 0 and 1.

K-Nearest Neighbors (KNN) algorithm determines the class of a data point based on the majority voting principle of the closest points.

Choosing an optimal K value in KNN is crucial to avoid overfitting or underfitting.

Decision trees partition data by asking iterative questions and are prone to overfitting without proper ensemble techniques.

Random Forest is an ensemble of decision trees that reduces the risk of overfitting and improves accuracy.

Gradient Boosted Decision Trees (GBDT) combines individual decision trees using boosting methods for efficient and accurate predictions.

K-means clustering is an unsupervised learning method that groups data points based on similarities and requires the number of clusters to be predetermined.

DBSCAN is a density-based clustering algorithm that can find arbitrary shaped clusters and is robust to outliers.

Principal Component Analysis (PCA) is a dimensionality reduction technique that derives new features to retain significant variance with fewer dimensions.

PCA is widely used as a preprocessing step for supervised learning algorithms to enhance performance.

The order of principal components in PCA is determined by the fraction of variance they explain.

Each machine learning algorithm has unique characteristics and applications, making them suitable for different types of problems and data sets.

Transcripts

play00:00

every single machine learning algorithm

play00:01

explained in case you don't know an

play00:03

algorithm is a set of commands that must

play00:05

be followed for a computer to perform

play00:07

calculations or like other

play00:08

problemsolving operations according to

play00:10

its formal definition an algorithm is a

play00:12

finite set of instructions carried out

play00:14

in a specific order to perform a

play00:16

particular task it's not an entire

play00:18

program or code it is simple logic to a

play00:21

problem linear regression is a

play00:23

supervised learning algorithm and tries

play00:25

to model their relationship between a

play00:26

continuous Target variable and one or

play00:29

more independent VAR variables by

play00:30

fitting a linear equation to the data

play00:32

take this chart of dots for example a

play00:34

linear regression model tries to fit a

play00:36

regression line to the data points that

play00:38

best represents the relations or

play00:39

correlations with this method the best

play00:42

regression line is found by minimizing

play00:44

the sum of squares of the distance

play00:46

between the data points and the

play00:47

regression line so for these data points

play00:49

the regression line Looks like this

play00:51

support Vector machine or svm for short

play00:54

is a supervised learning algorithm and

play00:56

is mostly used for classification tasks

play00:58

but is also suitable for regression R

play01:00

tasks svm distinguishes classes by

play01:02

drawing a decision boundary how to draw

play01:05

or determine the decision boundary is

play01:06

the most critical part in svm algorithms

play01:09

before creating the decision boundary

play01:11

each observation or data point is

play01:13

plotted in N dimensional space with n

play01:16

being the number of features used for

play01:18

example if we use length and width to

play01:21

classify different cells observations

play01:23

are plotted in a two-dimensional space

play01:25

and decision boundary is a line if we

play01:28

use three features decision boundary is

play01:30

a plane in three-dimensional space if we

play01:33

use more than three features decision

play01:35

boundary becomes a hyper plane which is

play01:37

really hard to visualize decision

play01:39

boundary is drawn in a way that the

play01:40

distance to support vectors are

play01:42

maximized if the decision boundary is

play01:44

too close to a support Vector it'll be

play01:47

highly sensitive to noises and not

play01:49

generalize well even very small changes

play01:51

to independent variables may cause a

play01:53

misclassification svm is especially

play01:56

effective in cases where number of

play01:57

dimensions are more than the number of

play01:59

samples when finding the decision

play02:01

boundary svm uses a subset of training

play02:04

points rather than all points which

play02:06

makes it memory efficient on the other

play02:08

hand training time increases for large

play02:10

data sets which negatively affects the

play02:12

performance Nave Bas is a supervised

play02:15

learning algorithm used for

play02:16

classification tasks hence it is also

play02:18

called Nave Bas classifier Nave Bas

play02:21

assumes that features are independent of

play02:23

each other and there is no correlation

play02:25

between features however this is not the

play02:28

case in real life this Nave Assumption

play02:30

of features being uncorrelated is the

play02:31

reason why this algorithm is called

play02:33

naive the intuition behind naive Bay

play02:36

algorithm is the Bas theorem p a is the

play02:39

probability of event a given event B has

play02:42

already occurred PBA is probability of

play02:45

event B given event a has already

play02:48

occurred PA is the probability of event

play02:51

a and PB is the probability of event B

play02:54

naive Bas classifier calculates the

play02:56

probability of a class given a set of

play02:58

feature values the the assumption that

play03:00

all features are independent makes knif

play03:02

based algorithm very fast when compared

play03:04

to complicated algorithms in some cases

play03:07

speed is preferred over higher accuracy

play03:09

but on the other hand the same

play03:10

assumption makes knife Bas algorithm

play03:13

less accurate than complicated

play03:14

algorithms logistic regression is a

play03:16

supervised learning algorithm which is

play03:18

mostly used for binary classification

play03:20

problems logistic regression is a simple

play03:23

yet very effective classification

play03:25

algorithm so it is commonly used for

play03:26

many binary classification tasks things

play03:28

like customer turn spam email website or

play03:31

ad click predictions are some examples

play03:33

of the areas where logistic regression

play03:35

offers a powerful solution the basis of

play03:37

logistic regression is the logistic

play03:39

function also called the sigmoid

play03:41

function which takes any real value

play03:44

number and Maps it to a value between 0o

play03:46

and 1 Let's consider we have the

play03:48

following linear equation to solve

play03:50

logistic regression model takes a linear

play03:52

equation as input and uses logistic

play03:54

function and log odds to perform a

play03:56

binary classification task then we will

play03:59

get the f famous shaped graph of

play04:01

logistic regression we can use the

play04:03

calculated probability as is for example

play04:05

the output can be the probability that

play04:07

this email is Spam is 95% or the

play04:10

probability that the customer will click

play04:12

on the ad is 70% however in most cases

play04:15

probabilities are used to classify data

play04:17

points for example if the probability is

play04:20

greater than 50% the prediction is

play04:22

positive class or one otherwise the

play04:25

prediction is negative class or zero K

play04:28

nearest Neighbors or K&N for short is a

play04:31

supervised learning algorithm that can

play04:32

be used to solve both classification and

play04:35

regression tasks the main idea behind

play04:37

KNN is that the value of a class or of a

play04:40

data point is determined by the data

play04:42

points around it KNN classifier

play04:44

determines the class of a data point by

play04:46

majority voting principle for instance

play04:49

if K is set to five the classes of five

play04:52

closest points are checked prediction is

play04:54

done according to the majority class

play04:56

similarly K&N regression takes the mean

play04:58

value of five CL closest points let's go

play05:00

over an example consider the following

play05:02

data points that belong to four

play05:04

different classes and let's see how the

play05:06

predicted classes change according to

play05:08

the K value it is very important to

play05:10

determine an optimal K value if K is too

play05:13

low the model is too specific and not

play05:15

generalized well it also tends to be too

play05:17

sensitive to noise the model

play05:19

accomplishes a high accuracy on train

play05:21

set but will be a poor predictor on new

play05:23

previously unseen data points therefore

play05:26

we are likely to end up with an overfit

play05:28

model on the the other hand if K is too

play05:30

large the model is too generalized and

play05:33

is not a good predictor on both train

play05:34

and test sets this situation is known as

play05:37

underfitting KNN is simple and easy to

play05:40

interpret it does not make any

play05:41

assumption so it can be implemented in

play05:43

nonlinear tasks KNN does become very

play05:46

slow as number of data points increases

play05:48

because the model needs to store all

play05:49

data points thus it is not memory

play05:51

efficient another downside of KNN is

play05:53

that it is sensitive to outliers

play05:55

decision trees work by iteratively

play05:57

asking questions to partition data it is

play05:59

easier to conceptualize the partitioning

play06:01

data with a visual representation of a

play06:04

decision tree this represents a decision

play06:06

tree to predict customer CH first split

play06:08

is based on monthly charges amount then

play06:10

the algorithm keeps asking questions to

play06:12

separate class labels the question get

play06:14

more specific as the tree gets deeper

play06:17

the aim is to increase the

play06:18

predictiveness as much as possible at

play06:20

each partitioning so that the model

play06:22

keeps gaining information about the data

play06:23

set randomly splitting the feature does

play06:25

not usually give us the valuable insight

play06:27

into the data set it's the splits that

play06:29

increase purity of nodes that are most

play06:31

informative the purity of a node is

play06:33

inversely proportional to the

play06:35

distribution of different classes in

play06:37

that node the questions to ask are

play06:39

chosen in a way that increases Purity or

play06:41

decreases impurity but how many

play06:43

questions do we ask when do we stop when

play06:45

is our tree sufficient to solve our

play06:47

classification problem the answer to all

play06:49

of these questions leads us to one of

play06:50

the most important Concepts in machine

play06:52

learning overfitting the model can keep

play06:55

asking questions until all nodes are

play06:56

pure however this would be a two

play06:58

specific model and would not generalize

play07:00

will it achieves high accuracy with

play07:02

training set but performs poorly on new

play07:04

previously unseen data points which

play07:06

indicates overfitting decision tree

play07:08

algorithm usually does not require to

play07:10

normalize or scale features it is also

play07:12

suitable to work on a mixture of feature

play07:14

data types on the negative side it is

play07:16

prone to overfitting and needs to be

play07:18

ensembled in order to generalize well

play07:21

random Forest is an ensemble of many

play07:24

decision trees random forests are built

play07:26

using a method called bagging in which

play07:27

decision trees are used as par parel

play07:29

estimators if used for a classification

play07:32

problem the result is based on majority

play07:33

vote of the results received from each

play07:35

decision tree for regression the

play07:37

prediction of a leaf node is the mean

play07:39

value of the target values in that leaf

play07:41

random Forest regression takes mean

play07:43

values of results from decision trees

play07:45

random forests reduce the risk of

play07:47

overfitting and accuracy is much higher

play07:49

than a single decision tree furthermore

play07:52

decision trees in a random forest run in

play07:54

parallel so that the time does not

play07:55

become a bottleneck the success of a

play07:58

random Forest highly depends on using

play08:00

uncorrelated decision trees if we use

play08:02

the same or very similar trees the

play08:04

overall result will not be much

play08:06

different than the result of a single

play08:07

decision tree random forests achieve to

play08:10

have uncorrelated decision trees by

play08:12

bootstrapping and feature Randomness

play08:14

bootstrapping is randomly selecting

play08:16

samples from training data with

play08:17

replacement they are called the

play08:19

bootstrap samples feature Randomness is

play08:22

achieved by selecting features randomly

play08:24

for each decision Tree in a random

play08:26

Forest the number of features used for

play08:28

each tree in a random Forest can be

play08:30

controlled with maxcore features

play08:31

parameter random Forest is a highly

play08:34

accurate model on many different

play08:35

problems and does not require

play08:37

normalization or scaling however it is

play08:39

not a good choice for high dimensional

play08:41

data sets compared to fast linear models

play08:43

gradient boosted decision trees or gbdt

play08:47

for short is an ensemble algorithm which

play08:49

uses boosting methods to combine

play08:51

individual decision trees boosting means

play08:53

combining a learning algorithm in series

play08:55

to achieve a strong learner from many

play08:58

sequentially connect weak Learners in

play09:00

the case of gbdt the weak Learners are

play09:03

the decision trees each tree attempts to

play09:05

minimize the errors of previous tree

play09:07

trees in boosting are weak learners but

play09:10

adding many trees in series and each

play09:12

focusing on the errors from the previous

play09:14

one make boosting a highly efficient and

play09:16

accurate model unlike bagging boosta

play09:18

does not involve bootstrap sampling

play09:20

every time a new tree is added it fits

play09:22

on a modified version of the initial

play09:25

data set since trees are added

play09:26

sequentially boosting algorithms learn

play09:29

slowly in statistical learning models

play09:31

that learn slowly perform better gbdt is

play09:34

very efficient on both classification

play09:36

and regression tasks and provides more

play09:38

accurate predictions compared to random

play09:40

Forest it can handle mixed type of

play09:42

features and no pre-processing is needed

play09:44

gbdt does require careful tuning of

play09:47

hyperparameters in order to prevent the

play09:48

model from overfitting K means

play09:50

clustering clustering is a way to group

play09:52

a set of data points in a way that

play09:54

similar data points are grouped together

play09:56

therefore clustering algorithms look for

play09:58

similarities or dissimilarities among

play10:00

data points clustering is an

play10:02

unsupervised learning method so there is

play10:04

no label associated with data points

play10:06

clustering algorithms try to find the

play10:08

underlying structure of the data

play10:10

observations or data points in a

play10:11

classification task have labels each

play10:14

observation is classified according to

play10:16

some measurements classification

play10:18

algorithms try to model the relationship

play10:19

between measurements on observations and

play10:22

their assigned class then the model

play10:24

predicts the class of new observations K

play10:26

means clustering aims to partition data

play10:28

into K clusters in a way that data

play10:30

points in the same cluster are similar

play10:32

and data points in different clusters

play10:34

are further apart thus it is a partition

play10:36

based clustering technique similarity of

play10:39

two points is determined by the distance

play10:41

between them consider the following 2D

play10:43

visualization of a data set it can be

play10:45

partied into four different clusters now

play10:47

real life data sets are much more

play10:49

complex in which clusters are not

play10:51

clearly separated however the algorithm

play10:53

works in the same way K means is an

play10:55

iterative process it is built on

play10:57

expectation maximization algorithm after

play11:00

the number of clusters are determined it

play11:02

works by executing the following steps

play11:04

number one it randomly selects the

play11:06

centroids or the center of cluster for

play11:08

each cluster then it calculates the

play11:10

distance of all data points to the

play11:11

centroids it assigns the data points to

play11:13

the closest cluster it finds the new

play11:16

centroids of each cluster by taking the

play11:17

mean of all data points in the cluster

play11:20

and it repeats steps 2 3 and four until

play11:22

all points converge and cluster Center

play11:24

stop moving K means clustering is

play11:26

relatively fast and easy to interpret it

play11:29

is also able to choose the positions of

play11:30

initial centroids in a smart way that

play11:32

speeds up the convergence the one

play11:34

challenge with K means is that the

play11:35

number of clusters must be predetermined

play11:38

cayman's algorithm is not able to guess

play11:40

how many clusters exist in the data if

play11:42

there is a nonlinear structure

play11:44

separating groups in the data K means

play11:46

will not be a good choice DB scan

play11:48

clustering partition based and

play11:50

hierarchical clustering techniques are

play11:52

highly efficient with normal shaped

play11:54

clusters however when it comes to

play11:56

arbitrary shaped clusters or detecting

play11:58

outliers density based techniques are

play12:00

more efficient DB scan stands for

play12:02

density based spatial clustering of

play12:04

applications with noise it is able to

play12:07

find arbitrary shaped clusters and

play12:09

clusters with noise the main idea behind

play12:11

DB scan is that a point belongs to a

play12:13

cluster if it is close to many points

play12:16

from that cluster there are two key

play12:18

parameters of DB scan EPS which is the

play12:21

distance that specifies the neighborhood

play12:23

two points are considered to be

play12:24

neighbors if the distance between them

play12:26

are less than or equal to EPs and Min

play12:29

pts which is the minimum number of data

play12:31

points to define a cluster based on

play12:33

these two parameters points are

play12:34

classified as score Point border point

play12:36

or outlier a point is a core point if

play12:39

there are at least Min pts number of

play12:41

points including the point itself in its

play12:44

surrounding area with radius EPS a point

play12:47

is a border point if it is unreachable

play12:49

from a core point and there are less

play12:51

than Min pts number of points within its

play12:54

surrounding area and a point is an

play12:56

outlier if it is not a core point and

play12:58

not reach from any core points DB scan

play13:00

does not require to specify a number of

play13:03

clusters beforehand it is robust to

play13:05

outliers and able to detect the outliers

play13:07

in some cases determining an appropriate

play13:09

distance of neighborhood EPS is not easy

play13:12

and it requires domain knowledge

play13:14

principle components analysis or PCA is

play13:17

a dimensionally reduction algorithm

play13:19

which basically derives new features

play13:21

from the existing ones with keeping as

play13:23

much information as possible PCA is an

play13:26

unsupervised learning algorithm but it

play13:28

is also widely used as a pre-processing

play13:30

step for supervised learning algorithms

play13:33

PCA deres new features by finding the

play13:35

relations among features in a data set

play13:37

the aim of PCA is to explain the

play13:39

variance within the original data set as

play13:41

much as possible by using less features

play13:44

the new derived features are called

play13:46

principal components the order of

play13:48

principal components is determined

play13:49

according to the fraction of variance of

play13:51

original data set they explain the

play13:53

advantage of PCA is that a significant

play13:55

amount of variance of the original data

play13:57

set is retained using much smaller

play13:59

number of features than the original

play14:01

data set principal components are

play14:03

ordered according to the amount of

play14:04

variants that they explain and that is

play14:06

every common machine learning algorithm

play14:08

explained

Rate This

5.0 / 5 (0 votes)

Related Tags
Machine LearningAlgorithmsSupervised LearningUnsupervised LearningClassificationRegressionK-MeansSVMRandom ForestPCAClustering