How is data prepared for machine learning?

AltexSoft
31 Aug 202113:57

Summary

TLDRThe video script delves into Amazon's scrapped AI recruitment tool, highlighting how a faulty dataset led to gender bias. It underscores the pivotal role of data quality and preparation in machine learning, illustrating the importance of data quantity, relevance, labeling, and cleansing. The script also touches on data reduction, wrangling, and feature engineering, emphasizing that despite the challenges, meticulous data handling is crucial for successful ML projects.

Takeaways

  • 🤖 In 2014, Amazon developed an AI recruitment tool that was designed to score job applicants but was found to be biased against women, illustrating the risks of using machine learning on skewed datasets.
  • 🚫 The Amazon AI recruitment tool was shut down in 2018 due to its sexist tendencies, which were a result of being trained on a predominantly male dataset.
  • 🧠 The success of machine learning projects heavily relies on the quality and representativeness of the training data, as highlighted by the Amazon case.
  • 📊 The amount of data needed for training a machine learning model can vary greatly, from hundreds to trillions of examples, depending on the complexity of the task.
  • 🔍 The principle 'garbage in, garbage out' is crucial in machine learning, emphasizing that poor quality data leads to poor model performance.
  • 📈 Data preparation is a critical and time-consuming step in machine learning, often accounting for up to 80% of a data science project's time.
  • 🏗️ Data scientists must transform raw data into a usable format through processes like labeling, reduction, cleansing, wrangling, and feature engineering.
  • 🔄 Labeling involves assigning correct answers to data samples, which is essential for supervised learning but can be prone to errors if not double-checked.
  • 🧼 Data cleansing is necessary to remove or correct corrupted, incomplete, or inaccurate data to prevent a model from learning incorrect patterns.
  • 🔗 Feature engineering involves creating new features from existing data, which can improve a model's predictive power by making the data more informative.
  • 🌐 The importance of data's relevance to the task at hand cannot be overstated, as using inappropriate data can lead to inaccurate and biased models.

Q & A

  • What was the purpose of Amazon's experimental ML-driven recruitment tool?

    -Amazon's experimental ML-driven recruitment tool was designed to screen resumes and give job applicants scores ranging from one to five stars, similar to the Amazon rating system, to help identify the best candidates.

  • Why did Amazon's machine learning model for recruitment turn out to be biased?

    -Amazon's machine learning model for recruitment became biased because it was trained on a dataset that predominantly consisted of resumes from men, which led the model to penalize resumes containing the word 'women's'.

  • What is the significance of data quality in machine learning projects?

    -Data quality is crucial in machine learning projects as it directly influences the model's performance. The principle 'garbage in, garbage out' applies, meaning that feeding a model with inaccurate or poor quality data will result in poor outcomes, regardless of the model's sophistication or the data scientists' expertise.

  • What is the role of data preparation in machine learning?

    -Data preparation is a critical step in machine learning, accounting for up to 80% of the time in a data science project. It involves transforming raw data into a form that best describes the underlying problem to a model and includes processes like labeling, data reduction, cleansing, wrangling, and feature engineering.

  • How does the size of the training dataset impact machine learning models?

    -The size of the training dataset can significantly impact machine learning models. While there is no one-size-fits-all formula, generally, the more data collected, the better, as it is difficult to predict which data samples will bring the most value. However, the quality and relevance of the data are also crucial.

  • What is dimensionality reduction and why is it important in machine learning?

    -Dimensionality reduction is the process of reducing the number of random variables under consideration, which can involve removing irrelevant features or combining features that contain similar information. It is important because it can improve the performance of machine learning algorithms by reducing complexity and computational resources required.

  • Why is data labeling necessary in supervised machine learning?

    -Data labeling is necessary in supervised machine learning because it provides the model with the correct answers to the given problem. By assigning corresponding labels within a dataset, the model learns to recognize patterns and make predictions on new, unseen data.

  • How can data cleansing help improve the performance of machine learning models?

    -Data cleansing helps improve the performance of machine learning models by removing or correcting incomplete, corrupted, or inaccurate data. By ensuring that the data fed into the model is clean and accurate, the model can make more reliable predictions and avoid being misled by poor quality data.

  • What is feature engineering and how does it contribute to machine learning?

    -Feature engineering is the process of using domain knowledge to select or construct features that make machine learning algorithms work. It contributes to machine learning by creating new features that can better represent the underlying problem, thus potentially improving the model's performance and accuracy.

  • How does data normalization help in machine learning?

    -Data normalization helps in machine learning by scaling the data to a common range, such as 0.0 to 1.0. This ensures that each feature contributes equally to the model's performance, preventing issues where features with larger numerical values might be considered more important than they actually are.

  • What are some challenges faced during data preparation for machine learning?

    -Challenges faced during data preparation for machine learning include determining the right amount of data, ensuring data quality and relevance, dealing with imbalanced datasets, handling missing or corrupted data, and the time-consuming nature of the process. Addressing these challenges is key to the success of machine learning projects.

Outlines

00:00

🤖 The Pitfalls of Amazon's AI Recruitment Tool

Amazon's experimental machine learning recruitment tool, which aimed to score job applicants similarly to its rating system, was discontinued due to gender bias. The model was trained on a decade's worth of resumes, predominantly from men, leading to skewed results. This highlights the critical importance of data quality in machine learning, as a faulty dataset can lead to significant issues. The video emphasizes the need for diverse and representative data to prevent algorithmic bias.

05:01

📈 The Crucial Role of Data in Machine Learning

The video discusses the preparatory steps for machine learning, starting with defining the problem and collecting a training dataset. It stresses that there's no universal formula for the optimal dataset size, as it depends on various factors including the problem's complexity and the learning algorithm used. Examples are given, such as Google's Gmail smart replies and Google Translate, which required millions of samples. The video also points out that data quantity and quality are both essential, with the latter being particularly important to avoid 'garbage in, garbage out' scenarios.

10:01

🔍 Data Preparation Techniques for Machine Learning

This section delves into the intricacies of data preparation for machine learning, including labeling, reduction, cleansing, and wrangling. Labeling involves assigning correct answers to examples, akin to teaching a child. Data reduction and cleansing involve removing irrelevant or corrupt data to enhance model performance. The video also touches on dimensionality reduction to simplify complex data and sampling to manage large datasets. It concludes by emphasizing the importance of data preparation in machine learning, which can consume up to 80% of a data science project's time.

Mindmap

Keywords

💡Machine Learning

Machine learning is a subset of artificial intelligence that provides systems the ability to learn and improve from experience without being explicitly programmed. In the context of the video, machine learning is used to develop a recruitment tool that scores job applicants. The video highlights how the Amazon AI project failed due to biases in the machine learning model, which underscores the importance of training data in shaping the outcomes of machine learning algorithms.

💡Bias

Bias in machine learning refers to the prejudice or unfair preference shown by an algorithm towards certain outcomes. The video script discusses how Amazon's machine learning model was biased against women, as it was trained on a dataset predominantly consisting of male applicants, thus illustrating the concept of bias in AI systems.

💡Dataset

A dataset is a collection of data that has been gathered and stored for analysis. The video emphasizes the importance of the quality and representativeness of a dataset in machine learning. It points out that Amazon's AI tool was flawed because it was trained on a 'faulty dataset' that was imbalanced and did not accurately represent the diversity of potential candidates.

💡Data Preparation

Data preparation involves the processes of cleaning, transforming, and reducing data to make it suitable for analysis. The video script explains that data preparation is a crucial step in machine learning, taking up to 80% of a data science project's time. It includes labeling, reduction, cleansing, and wrangling to ensure the data is accurate and relevant for the model.

💡Labeling

Labeling in machine learning is the process of assigning a label or category to each data instance in a dataset. The video uses the analogy of teaching a child to recognize apples by showing them labeled pictures. In machine learning, labeling is essential for supervised learning, where the model learns from examples with known outcomes.

💡Feature

In machine learning, a feature is an individual measurable property or characteristic of a phenomenon being observed. The video mentions features as the attributes that help a model make predictions, such as the shape, color, and texture of apples in an image recognition task. Features are the basis for the patterns that machine learning algorithms learn from.

💡Data Reduction

Data reduction is the process of minimizing the amount of data while retaining its informational content. The video script explains that not all data collected is valuable for a machine learning project. Reducing data involves removing irrelevant or redundant features to improve the model's performance and efficiency.

💡Data Cleansing

Data cleansing, also known as data cleaning, is the process of detecting and correcting (or removing) corrupt or inaccurate records from a dataset. The video script points out that data sets are often incomplete or contain errors, and cleansing is necessary to ensure that the data fed into the model is accurate and reliable.

💡Data Wrangling

Data wrangling is the process of transforming and mapping data from its raw form into another format that can be more easily consumed for analysis. The video script uses the example of formatting data into a consistent format and normalizing data attributes to ensure that the model can accurately interpret and learn from the data.

💡Feature Engineering

Feature engineering is the process of using domain knowledge to select or construct features that make machine learning algorithms work. The video script discusses how feature engineering can involve creating new features from existing data to better represent the problem at hand, such as decomposing datetime information into date and time features for a hotel room demand prediction model.

💡Normalization

Normalization is a data preprocessing technique that rescales the data to a common scale, often between 0 and 1. The video script explains that normalization is necessary to ensure that all features contribute equally to the model's predictions, using the example of adjusting the scales of different financial figures to prevent some from dominating due to their larger magnitude.

Highlights

Amazon's experimental ML recruitment tool was designed to score job applicants' resumes.

The tool was found to be biased, favoring male candidates over female ones.

The project was shut down in 2018 due to its sexist tendencies.

The bias was attributed to a faulty dataset used for training the model.

Data quality is critical for the success of machine learning projects.

There's no fixed formula for the optimal size of a training dataset.

Google's Smart Reply feature was trained on 238 million sample messages.

Google Translate required trillions of examples for its development.

AI Chang from Tam Kang University used a smaller dataset to predict concrete strength.

The size of training data depends on the complexity of the project.

Data quality is as important as its quantity for effective machine learning.

Data should be relevant and adequate for the task at hand.

Data preparation includes labeling, which is crucial for supervised learning.

Labeling data is akin to teaching a child to recognize objects.

Features are the measurable characteristics that describe data to a model.

Data reduction and cleansing are essential steps in data preparation.

Dimensionality reduction helps improve model performance by focusing on relevant features.

Sampling can be used to manage large datasets and address imbalanced class distributions.

Data cleansing involves dealing with incomplete, corrupted, or inaccurate data.

Data wrangling transforms raw data into a format that is understandable for a model.

Normalization ensures that all features have equal importance in the model.

Feature engineering involves creating new features that can improve model efficiency.

Data preparation is a critical and time-consuming step in machine learning projects.

The quality of training data directly impacts the accuracy and fairness of AI models.

Transcripts

play00:00

in 2014 amazon started working on its

play00:03

experimental ml driven recruitment tool

play00:06

similar to the amazon rating system the

play00:08

hiring tool was supposed to give job

play00:10

applicants scores ranging from one to

play00:11

five stars when screening resumes for

play00:13

the best candidates

play00:15

yeah the idea was great but

play00:18

it seemed that the machine learning

play00:20

model only liked men

play00:22

it penalized all resumes containing the

play00:24

word women's as in women's softball team

play00:27

captain

play00:29

in 2018 reuters broke the news that

play00:31

amazon eventually had shut down the

play00:33

project

play00:34

now the million-dollar question

play00:36

how come amazon's machine learning model

play00:39

turned out to be sexist a ai goes rogue

play00:43

b inexperienced data scientists

play00:46

c

play00:47

faulty data set d

play00:49

alexa gets jealous

play00:51

the correct answer is c

play00:53

faulty dataset

play00:56

not exclusively of course getdata is one

play00:58

of the main factors determining whether

play01:00

ml projects will succeed or fail in the

play01:03

case of amazon models were trained on 10

play01:05

years worth of resumes submitted to the

play01:07

company for the most part by men

play01:09

so here's another million dollar

play01:11

question

play01:12

how is data prepared for machine

play01:14

learning

play01:18

all the magic begins with planning and

play01:20

formulating the problem that needs to be

play01:22

solved with the help of machine learning

play01:25

pretty much the same as with any other

play01:26

business decision

play01:28

then you start constructing a training

play01:30

data set and

play01:31

stumble on the first rock

play01:33

how much data is enough to train a good

play01:35

model

play01:37

just a couple samples

play01:38

thousands of them or even more

play01:41

the thing is there's no

play01:42

one-size-fits-all formula to help you

play01:44

calculate the right size of data set for

play01:46

a machine learning model here many

play01:48

factors play their role from the problem

play01:50

you want to address to the learning

play01:52

algorithm you apply within the model

play01:54

the simple rule of thumb is to collect

play01:56

as much data as possible because it's

play01:58

difficult to predict which and how many

play02:00

data samples will bring the most value

play02:02

in simple words there should be a lot of

play02:05

training data well a lot sounds a bit

play02:07

too vague right

play02:09

here are a couple of real-life examples

play02:11

for a better understanding

play02:13

you know gmail from google right

play02:15

it's smart reply suggestions save time

play02:18

for users generating short email

play02:20

responses right away to make that happen

play02:22

the google team collected and

play02:23

pre-processed the training set that

play02:25

consisted of 238 million sample messages

play02:29

with and without responses as far as

play02:31

google translate it took trillions of

play02:33

examples for the whole project

play02:35

but it doesn't mean you also need to

play02:37

strive for these huge numbers

play02:39

ai chang yay tam kang university

play02:41

professor used the data set consisting

play02:43

of only 630 data samples with them he

play02:47

successfully trained the model of a

play02:48

neural network to accurately predict the

play02:51

compressive strength of high performance

play02:53

concrete as you can see the size of

play02:55

training data depends on the complexity

play02:57

of the project in the first place at the

play03:00

same time it is not only the size of the

play03:02

data set that matters but also its

play03:04

quality

play03:06

what can be considered as quality data

play03:10

the good old principle garbage in

play03:12

garbage out states a machine learns

play03:13

exactly what it's taught feed your model

play03:16

inaccurate or poor quality data and no

play03:18

matter how great the model is how

play03:20

experienced your data scientists are or

play03:23

how much money you spend on the project

play03:25

you won't get any decent results

play03:28

remember amazon that's what we're

play03:29

talking about

play03:31

okay it seems that the solution to the

play03:33

problem is kind of obvious avoid the

play03:35

garbage in part and you're golden but

play03:38

it's not that easy

play03:39

say you need to forecast turkey sales

play03:41

during the thanksgiving holidays in the

play03:43

u.s but the historical data you're about

play03:45

to train your model on encompasses only

play03:48

canada you may think thanksgiving here

play03:50

thanksgiving there what's the difference

play03:52

to start with canadians don't make that

play03:54

big of a fuss about turkey the bird

play03:56

suffers an embarrassing loss in the

play03:58

battle to pumpkin pies also the holiday

play04:01

isn't observed nationwide not to mention

play04:03

that canada celebrates thanksgiving in

play04:05

october not november chances are such

play04:08

data is just inadequate for the u.s

play04:10

market this example shows how important

play04:13

it is to ensure not only the high

play04:14

quality of data but also its adequacy to

play04:17

the set task

play04:19

then the selected data has to be

play04:20

transformed into the most digestible

play04:22

form for a model so you need data

play04:24

preparation

play04:29

for instance in supervised machine

play04:31

learning you inevitably go through a

play04:33

process called labeling

play04:36

this means you show a model the correct

play04:38

answers to the given problem by leaving

play04:40

corresponding labels within a data set

play04:43

labeling can be compared to how you

play04:45

teach a kid what apples look like

play04:47

first you show pictures and you see that

play04:49

these are well apples then you repeat

play04:53

the procedure when the kid has seen

play04:54

enough pictures of different apples the

play04:56

kid will be able to distinguish apples

play04:58

from other kinds of fruit

play05:01

okay what if it's not a kid that needs

play05:03

to detect apples and pictures but a

play05:05

machine the model needs some measurable

play05:07

characteristics that will describe data

play05:09

to it such characteristics are called

play05:12

features

play05:13

in the case of apples the features that

play05:15

differentiate apples from other fruit on

play05:17

images are their shape color and texture

play05:21

to name a few just like the kid when the

play05:23

model has seen enough examples of the

play05:25

features it needs to predict it can

play05:27

apply learned patterns and decide on new

play05:29

data inputs on its own

play05:31

when it comes to images humans must

play05:33

label them manually for the machine to

play05:35

learn from

play05:36

of course there are some tricks like

play05:38

what google does with their recaptcha

play05:41

yeah just so you know you've been

play05:43

helping google build its database for

play05:45

years every time you proved you weren't

play05:47

a robot

play05:48

but labels can be already available in

play05:50

data for instance if you're building a

play05:53

model to predict whether a person is

play05:55

going to repay a loan you'd have the

play05:57

loan repayments and bankruptcy's history

play06:00

anyway

play06:01

it's so cool and easy in an ideal world

play06:04

in practice there may be issues like

play06:06

mislabeled data samples

play06:09

getting back to our apple recognition

play06:11

example

play06:12

well you see that the third part of

play06:14

training images shows peaches marked as

play06:16

apples if you leave it like that the

play06:18

model will think that pages are apples

play06:20

too

play06:21

and that's not the result you're looking

play06:22

for so it makes sense to have several

play06:25

people double check or cross-label the

play06:27

data set

play06:28

of course labeling isn't the only

play06:30

procedure needed when preparing data for

play06:32

machine learning one of the most crucial

play06:35

data preparation processes is data

play06:37

reduction and cleansing

play06:39

wait what reduce data clean it shouldn't

play06:42

we collect all the data possible

play06:44

well you do need to collect all possible

play06:47

data but it doesn't mean that every

play06:49

piece of it carries value for your

play06:50

machine learning project so you do the

play06:53

reduction to put only relevant data in

play06:55

your model picture this you work for a

play06:58

hotel and want to build an ml model to

play07:00

forecast customer demand for twin and

play07:03

single rooms this year you have a huge

play07:05

data set with different variables like

play07:07

customer demographics and information on

play07:09

how many times each customer booked a

play07:11

particular hotel room last year what you

play07:14

see here is just a tiny piece of a

play07:16

spreadsheet

play07:17

in reality there may be thousands of

play07:19

columns and rows

play07:21

let's imagine that the columns are

play07:22

dimensions on the 100 dimensional space

play07:25

with rows of data as points within that

play07:27

space

play07:28

it will be difficult to do since we are

play07:30

used to three space dimensions but each

play07:32

column is really a separate dimension

play07:34

here and it's also a feature fed as

play07:37

input to a model the thing is when the

play07:40

number of dimensions is too big and some

play07:42

of those aren't very useful the

play07:44

performance of the machine learning

play07:46

algorithms can decrease

play07:48

logically you need to reduce the number

play07:50

right

play07:51

that's what dimensionality reduction is

play07:54

about

play07:55

for example you can completely remove

play07:57

features that have zero or close to zero

play07:59

variance

play08:00

like in the case of the country feature

play08:03

in our table since all customers come

play08:05

from the us the presence of this feature

play08:07

won't make much impact on the prediction

play08:09

accuracy there's also redundant data

play08:12

like the year of birth feature as it

play08:13

presents the same info as the age

play08:16

variable why use both if it's basically

play08:18

a duplicate

play08:20

another common pre-processing practice

play08:22

is sampling

play08:24

often you need to prototype solutions

play08:26

before actual production

play08:28

if collected data sets are just too big

play08:30

they can slow down the training process

play08:32

as they require larger computational and

play08:35

memory resources and take more time for

play08:37

algorithms to run on with sampling you

play08:40

single out just a subset of examples for

play08:42

training instead of using the whole data

play08:44

set right away speeding the exploration

play08:46

and prototyping of solutions

play08:48

sampling methods can also be applied to

play08:50

solve the imbalanced data issue

play08:52

involving data sets where the class

play08:54

representation is not equal

play08:57

that's the problem amazon had when

play08:58

building their tool

play09:00

the training data was imbalanced with

play09:02

the prevailing part of resumes submitted

play09:04

by men making female resumes a minority

play09:07

class

play09:09

the model would have provided less

play09:10

biased results if it had been trained on

play09:12

a sampled training data set with a more

play09:15

equal class distribution made prior

play09:17

what about cleaning them

play09:20

data sets are often incomplete

play09:22

containing empty cells meaningless

play09:24

records or question marks instead of

play09:26

necessary values

play09:28

not to mention that some data can be

play09:30

corrupted or just inaccurate

play09:32

that needs to be fixed it's better to

play09:34

feed a model with imputed data than

play09:36

leave blank spaces for it to speculate

play09:39

as an example you fill in missing values

play09:41

with selected constants or some

play09:43

predicted values based on other

play09:45

observations in the data set

play09:47

as far as corrupted or inaccurate data

play09:50

you simply delete it from a set

play09:52

okay data is reduced and cleansed here

play09:55

comes another fun part data wrangling

play09:58

this means transforming raw data into a

play10:01

form that best describes the underlying

play10:03

problem to a model the step may include

play10:06

such techniques as formatting and

play10:08

normalization

play10:09

well these words sound too

play10:11

techy but they aren't that scary

play10:14

combining data from multiple sources may

play10:16

not be in a format that fits your

play10:18

machine learning system best for example

play10:20

collected data comes in xls file format

play10:24

but you need it to be in plain text

play10:25

formats like dot csv so you perform

play10:29

formatting

play10:31

in addition to that you should make all

play10:33

data instances consistent throughout the

play10:34

data sets

play10:36

say a state in one system could be

play10:38

florida in another it could be fl pick

play10:41

one and make it a standard

play10:44

you may have different data attributes

play10:45

with numbers of different scales

play10:47

presenting quantities like pounds

play10:49

dollars or sales volumes for example you

play10:52

need to predict how much turkey people

play10:53

will buy during this year's thanksgiving

play10:55

holiday consider that your historical

play10:57

data contains two features the number of

play11:00

turkeys sold and the amount of money

play11:02

received from the sales

play11:04

but here's the thing the turkey quantity

play11:06

ranges from 100 to 900 per day while the

play11:09

amount of money ranges from 1500 to 13

play11:12

000 if you leave it like this some

play11:14

models may consider that money values

play11:16

have higher importance to the prediction

play11:18

because they are simply bigger numbers

play11:20

to ensure each feature has equal

play11:22

importance to model performance

play11:24

normalization is applied

play11:26

it helps unify the scale of figures from

play11:29

say 0.0 to 1.0 for the smallest and

play11:33

largest value of a given feature one of

play11:36

the classical ways to do that is the min

play11:38

max normalization approach

play11:41

for example if we were to normalize the

play11:43

amount of money the minimum value 1500

play11:46

is transformed into a zero the maximum

play11:49

value 13 000 is transformed into one

play11:53

values in between become decimals say

play11:56

2700 will be 0.1 and 7 000 will become

play12:00

0.5 you get the idea

play12:04

up until now we've been talking about

play12:06

working with only those features already

play12:08

present in data sometimes you deal with

play12:10

tasks that require the creation of new

play12:12

features

play12:13

this is called feature engineering

play12:16

for instance we can split complex

play12:18

variables into parts that can be more

play12:20

useful for the model say you want to

play12:22

predict customer demand for hotel rooms

play12:24

in your data set you have date time

play12:26

information in its native form that

play12:28

looks like this

play12:30

you know that demand changes depending

play12:32

on days and months you have more

play12:33

bookings during holidays and peak

play12:35

seasons on top of that your demand

play12:38

fluctuates depending on specific time

play12:40

say you have more bookings at night and

play12:42

much fewer in the morning if that's the

play12:44

case both time and date information have

play12:48

their own predictive powers to make the

play12:50

model more efficient you can decompose

play12:52

the date from the time by creating two

play12:55

new numerical features one for the date

play12:58

and the other for the time

play13:02

a machine learning model can only get as

play13:05

smart and accurate as the training data

play13:07

you're feeding it it can't get biased on

play13:09

its own it can't get sexist on its own

play13:13

it can't get anything on its own

play13:15

and while the unfitting data set wasn't

play13:17

the only reason for the amazon ai

play13:20

project failure it still owned a lion's

play13:22

share of the result

play13:26

the truth is there are no flawless data

play13:28

sets but striving to make them flawless

play13:30

is the key to success that's why data

play13:33

preparation is such a crucial step in

play13:35

the machine learning process and that's

play13:37

why it takes up to 80 percent of every

play13:39

data science project's time

play13:41

speaking of projects more information

play13:44

can be found in our videos about data

play13:45

science teams and data engineering

play13:48

thank you for watching

Rate This

5.0 / 5 (0 votes)

相关标签
Machine LearningData BiasAI RecruitmentAmazon AIData ScienceModel TrainingData PreparationFeature EngineeringData CleansingPredictive Modeling
您是否需要英文摘要?