Machine Learning Fundamentals: Cross Validation

StatQuest with Josh Starmer
24 Apr 201806:04

Summary

TLDRIn this StatQuest video, Josh Starmer explains cross-validation, a technique for comparing different machine learning methods to predict heart disease using variables like chest pain and blood circulation. The video illustrates the importance of not using all data for training to avoid overfitting and introduces the concept of dividing data into blocks for training and testing, exemplified by four-fold cross-validation. It also touches on using cross-validation to find the best tuning parameters, like in Ridge regression, and concludes with an invitation to subscribe for more content.

Takeaways

  • 📚 Cross-validation is a technique used to compare different machine learning methods and assess their performance in practice.
  • 🔍 The purpose of using variables like chest pain and blood circulation is to predict heart disease, which is the main focus of the data in the script.
  • 🤖 Machine learning methods such as logistic regression, K nearest neighbors, and support vector machines are potential options for analysis.
  • 🚫 Using all data for training and no data for testing is not advisable as it doesn't allow for method evaluation on unseen data.
  • 📈 Splitting data into training and testing sets, such as 75% for training and 25% for testing, is a common approach but not always optimal.
  • 🔄 Cross-validation systematically uses all subsets of data for both training and testing to ensure a fair evaluation of the machine learning methods.
  • 📉 In the provided example, four-fold cross-validation is used, but the number of folds can vary based on the dataset and the analysis needs.
  • 👉 'Leave One Out Cross Validation' is an extreme form of cross-validation where each sample is used for testing once and the rest for training.
  • 🔢 10-fold cross-validation is a common practice that divides the data into ten parts, using nine for training and one for testing in each iteration.
  • 🛠 Cross-validation can also help in tuning hyperparameters of machine learning models, such as the tuning parameter in Ridge regression.
  • 🎓 The script is educational, aiming to teach viewers about cross-validation through an example and encouraging them to subscribe for more content.

Q & A

  • What is the main topic discussed in the StatQuest video?

    -The main topic discussed in the video is cross-validation, a technique used in machine learning to compare and evaluate different machine learning methods.

  • What are the variables mentioned for predicting heart disease in the script?

    -The variables mentioned for predicting heart disease include chest pain and good blood circulation.

  • What is the purpose of training a machine learning algorithm?

    -The purpose of training a machine learning algorithm is to estimate its parameters using some of the available data, which helps in learning the underlying patterns in the data.

  • Why is it not ideal to use the same data for both training and testing a machine learning model?

    -Using the same data for both training and testing is not ideal because it does not provide an unbiased evaluation of the model's performance on new, unseen data.

  • What is the basic idea behind cross-validation?

    -The basic idea behind cross-validation is to use different subsets of the data for training and testing the machine learning model in a way that every data point gets to be in a test set exactly once.

  • What is the term used for dividing the data into four parts for cross-validation?

    -The term used for dividing the data into four parts for cross-validation is four-fold cross-validation.

  • What is the term for the cross-validation technique where each individual sample is used as a test set?

    -The term for the cross-validation technique where each individual sample is used as a test set is 'Leave One Out Cross Validation'.

  • How many blocks are typically used in 10-fold cross-validation?

    -In 10-fold cross-validation, the data is divided into 10 blocks or subsets.

  • What is a tuning parameter in the context of machine learning?

    -A tuning parameter in machine learning is a parameter that is not estimated from the data but is set by the user or found through techniques like cross-validation to optimize the model's performance.

  • How can cross-validation help in finding the best value for a tuning parameter?

    -Cross-validation can help in finding the best value for a tuning parameter by systematically testing different values and evaluating the model's performance for each, thus identifying the value that yields the best results.

  • What is the final decision made in the script based on the cross-validation results?

    -The final decision made in the script is to use the support vector machine method for classification, as it performed the best in classifying the test data sets during cross-validation.

Outlines

00:00

🧑‍🏫 Introduction to Cross-Validation

Josh Starmer from StatQuest introduces the concept of cross-validation in machine learning. The video begins with a discussion on using various variables like chest pain and blood circulation to predict heart disease. It explains the need to choose an appropriate machine learning method, such as logistic regression, K-nearest neighbors, or support vector machines. Cross-validation is presented as a method to compare these machine learning methods to determine their effectiveness. The script outlines the importance of using data to both train and test machine learning algorithms, emphasizing that using the same data for both could lead to overfitting. It introduces the idea of dividing the data into blocks and using different combinations for training and testing to get a robust evaluation of the algorithms. The example of four-fold cross-validation is given, where the data is divided into four blocks, and each block is used once for testing while the others are used for training. The paragraph concludes with a decision to use the support vector machine based on its performance in classifying the test data.

05:04

🔍 Advanced Cross-Validation Techniques

In the second paragraph, the video script delves into advanced uses of cross-validation. It mentions the scenario where a machine learning method has a tuning parameter that needs to be optimized. An example given is Ridge regression, which has a tuning parameter that isn't estimated by the algorithm but is set by the user. The script explains how 10-fold cross-validation can be used to find the optimal value for such tuning parameters. The video ends with an invitation for viewers to subscribe for more content, to like the video, and to consider purchasing original songs by the presenter, Josh Starmer. The script wraps up with a light-hearted 'Double BAM!!!' and 'Tiny BAM!' to emphasize the completion of the tutorial and the excitement around the topic.

Mindmap

Keywords

💡Cross-validation

Cross-validation is a statistical method used to evaluate the performance of machine learning models on a dataset. It involves dividing the data into multiple subsets, training the model on some subsets, and testing it on the remaining subsets. This process is repeated multiple times with different subsets used for training and testing each time. In the video, cross-validation is used to compare different machine learning methods like logistic regression, K-nearest neighbors, and support vector machines to determine which one performs best. It ensures that the model's performance is not just due to chance or specific to a particular subset of data.

💡Machine Learning

Machine learning is a subset of artificial intelligence that focuses on the development of algorithms and statistical models that enable computers to learn from and make predictions on data. In the context of the video, machine learning is used to predict whether a patient has heart disease based on variables like chest pain and blood circulation. The video discusses how to choose the best machine learning method for this task using cross-validation.

💡Logistic Regression

Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). In the video, logistic regression is one of the machine learning methods considered for predicting heart disease. It is used to estimate the probability of the presence of heart disease based on the input variables.

💡K-Nearest Neighbors (KNN)

K-Nearest Neighbors is a simple, supervised machine learning algorithm that is used for classification and regression. It works by finding the 'k' closest data points (neighbors) to the given data point and predicting the output based on their values. In the video script, KNN is mentioned as one of the machine learning methods that could be used to predict heart disease, highlighting its use in classification tasks.

💡Support Vector Machines (SVM)

Support Vector Machines are a set of supervised learning methods used for classification and regression tasks. They are particularly effective in cases where the number of dimensions is greater than the number of samples. In the video, SVM is highlighted as one of the machine learning methods being evaluated. It is noted that in the example given, SVM performed the best in classifying the test data sets, making it the chosen method for predicting heart disease.

💡Parameters

In the context of machine learning, parameters are the values that are learned from the data during the training process. These values define the model and are used to make predictions on new data. The video explains that estimating parameters, or training the algorithm, is the first step in using machine learning methods. For example, in logistic regression, the parameters define the shape of the curve that is used to predict the outcome.

💡Training

Training in machine learning refers to the process of fitting the model to the data. It involves adjusting the parameters of the model so that it can make accurate predictions. In the video, training is mentioned as the first use of the data in the machine learning process, where the algorithm learns from the data to estimate its parameters.

💡Testing

Testing in machine learning is the process of evaluating the performance of a trained model on a separate dataset that was not used during the training process. This helps to determine how well the model generalizes to new, unseen data. The video script emphasizes the importance of testing to ensure that the machine learning method performs well on data it wasn't trained on.

💡Four-fold Cross Validation

Four-fold cross-validation is a specific type of cross-validation where the dataset is divided into four parts. The model is trained on three parts and tested on the fourth part, and this process is repeated four times, with each part serving as the test set once. The video uses this method as an example to explain how cross-validation works, highlighting that each block of data is used for testing once.

💡Leave One Out Cross Validation

Leave One Out Cross Validation (LOOCV) is a technique where each sample is used once as a test set while the remaining samples form the training set. This is an extreme case of cross-validation where the number of blocks is equal to the number of samples. The video mentions LOOCV as an example of how cross-validation can be applied, emphasizing that each individual sample is tested individually.

💡10-fold Cross Validation

10-fold cross-validation is a technique where the dataset is divided into ten parts, or folds. The model is trained on nine parts and tested on the tenth part, and this process is repeated ten times, with each part serving as the test set once. The video script notes that 10-fold cross-validation is commonly used in practice as it provides a good balance between training and testing while still maintaining a robust evaluation of the model's performance.

💡Tuning Parameter

A tuning parameter in machine learning is a parameter that is not learned from the data but is set by the practitioner. These parameters control the behavior of the model and can significantly affect its performance. In the video, ridge regression is given as an example of a method that has a tuning parameter. The video suggests using cross-validation to find the best value for this tuning parameter, which can optimize the model's performance.

Highlights

Introduction to cross-validation in machine learning by StatQuest with Josh Starmer.

Cross-validation helps decide the best machine learning method for a given dataset.

Data variables such as chest pain and blood circulation are used to predict heart disease.

Different machine learning methods like logistic regression, K nearest neighbors, and support vector machines are discussed.

The importance of training and testing algorithms to estimate parameters and evaluate performance.

Avoiding the use of all data for training to prevent lack of data for testing.

Using a fixed split of data for training and testing may not always be the best approach.

Cross-validation uses all data blocks one at a time for testing to ensure robust evaluation.

Four-fold cross-validation as an example of dividing data into blocks for validation.

The concept of 'Leave One Out Cross Validation' where each sample is used for testing individually.

10-fold cross-validation as a common practice in machine learning.

Cross-validation can also be used to find the best value for tuning parameters in algorithms.

Support vector machines being chosen as the best method for classifying test data sets in the example.

The practical application of cross-validation in choosing and tuning machine learning models.

StatQuest encourages viewers to subscribe for more educational content.

Invitation to support StatQuest by liking the video and considering purchasing original songs.

Transcripts

play00:00

StatQuest

play00:01

Check it out

play00:03

talking about

play00:05

Machine-learning. Yeah StatQuest

play00:08

Check it out

play00:09

Talking about cross-validation

play00:12

StatQuest

play00:15

Hello, I'm Josh stormer and welcome to StatQuest today we're going to talk about cross validation and it's gonna be clearly explained

play00:25

Okay, let's start with some data

play00:28

We want to use the variables chest pain good blood circulation

play00:33

Etc

play00:34

To predict if someone has heart disease

play00:37

Then when a new patient shows up

play00:40

we can measure these variables and

play00:43

Predict if they have heart disease or not

play00:47

However, first we have to decide which machine learning method would be best

play00:53

we could use logistic regression or

play00:56

K nearest neighbors

play00:59

Or support vector machines and

play01:03

Many more machine learning methods. How do we decide which one to use?

play01:09

Cross-validation allows us to compare different machine learning methods and get a sense of how well they will work in practice

play01:19

Imagine that this blue column represented all of the data that we have collected about people with and without heart disease

play01:27

We need to do two things with this data

play01:30

One we need to estimate the parameters for the machine learning methods in

play01:36

In other words to use logistic regression we have to use some of the data to estimate the shape of this curve

play01:44

in machine learning lingo

play01:47

Estimating parameters is called training the algorithm

play01:51

The second thing we need to do with this data is evaluate how well the machine learning methods work in?

play01:58

Other words we need to find out if this curve will do a good job categorizing new data in

play02:06

In machine learning lingo

play02:09

Evaluating a method is called testing the algorithm

play02:13

Thus using machine learning lingo we need the data to

play02:18

one train the machine learning methods and

play02:22

to test the machine learning methods a

play02:27

A terrible approach would be to use all the data to estimate the parameters ie to train the algorithm

play02:35

Because then we wouldn't have any data left to test the method

play02:40

Reusing the same data for both training and

play02:43

Testing is a bad idea because we need to know how the method will work on data. It wasn't trained on a

play02:52

Slightly better idea would be to use the first seventy-five percent of the data for training and the last 25% of the data for testing

play02:59

play03:02

We could then compare methods by seeing how well each one categorized the test data

play03:09

But how do we know that using the first?

play03:11

Seventy-five percent of the data for training in the last 25% of the data for testing is the best way to divide up the data

play03:21

What if we use the first 25% of the data for testing

play03:26

Or what about one of these middle blocks?

play03:29

Rather than worry too much about which block would be best for testing cross-validation uses them all one at a time and summarizes the results at the end

play03:34

play03:38

play03:41

For example cross-validation would start by using the first three blocks to train the method and

play03:49

then use the last block to test the method and

play03:53

Then it keeps track of how well the method did with the test data

play03:58

then it uses this combination of blocks to train the method and

play04:03

this block is used for testing and

play04:07

Then it keeps track of how well the method did with the test data, etc

play04:12

Etc, etc

play04:14

play04:16

in the end every block of data is used for testing and we can compare methods by seeing how well they performed in

play04:25

This case since the support vector machine did the best job classifying the test data sets. We'll use it

play04:33

BAM!!!

play04:36

Note: in this example, we divided the data into 4 blocks. This is called four-fold cross validation

play04:45

However, the number of blocks is arbitrary

play04:49

In an extreme case we could call each individual patient (or sample) a block

play04:56

This is called "Leave One Out Cross Validation"

play04:59

Each sample is tested individually

play05:03

That said in practice it is very common to divide the data into ten blocks. This is called 10-fold cross-validation

play05:14

Double BAM!!!

play05:16

One last note before we're done

play05:20

Say like we wanted to use a method that involved a tuning parameter a parameter that isn't estimated but is just sort of guessed

play05:28

For example Ridge regression has a tuning parameter

play05:33

Then we could use 10-fold cross validation

play05:36

to help find the best value for that tuning parameter

play05:40

Tiny Bam!

play05:42

Hooray we've made it to the end of another exciting StatQuest if you like this StatQuest and want to see more please subscribe

play05:50

And if you want to support StatQuest well

play05:54

Please click the like button down below and consider buying one of my original songs

play05:59

Alright until next time quest on

Rate This

5.0 / 5 (0 votes)

相关标签
Machine LearningCross-ValidationLogistic RegressionK-Nearest NeighborsSupport Vector MachinesData ScienceParameter EstimationModel EvaluationStatQuestJosh StarmerML Techniques
您是否需要英文摘要?