How is data prepared for machine learning?
Summary
TLDRThe video script delves into Amazon's scrapped AI recruitment tool, highlighting how a faulty dataset led to gender bias. It underscores the pivotal role of data quality and preparation in machine learning, illustrating the importance of data quantity, relevance, labeling, and cleansing. The script also touches on data reduction, wrangling, and feature engineering, emphasizing that despite the challenges, meticulous data handling is crucial for successful ML projects.
Takeaways
- ๐ค In 2014, Amazon developed an AI recruitment tool that was designed to score job applicants but was found to be biased against women, illustrating the risks of using machine learning on skewed datasets.
- ๐ซ The Amazon AI recruitment tool was shut down in 2018 due to its sexist tendencies, which were a result of being trained on a predominantly male dataset.
- ๐ง The success of machine learning projects heavily relies on the quality and representativeness of the training data, as highlighted by the Amazon case.
- ๐ The amount of data needed for training a machine learning model can vary greatly, from hundreds to trillions of examples, depending on the complexity of the task.
- ๐ The principle 'garbage in, garbage out' is crucial in machine learning, emphasizing that poor quality data leads to poor model performance.
- ๐ Data preparation is a critical and time-consuming step in machine learning, often accounting for up to 80% of a data science project's time.
- ๐๏ธ Data scientists must transform raw data into a usable format through processes like labeling, reduction, cleansing, wrangling, and feature engineering.
- ๐ Labeling involves assigning correct answers to data samples, which is essential for supervised learning but can be prone to errors if not double-checked.
- ๐งผ Data cleansing is necessary to remove or correct corrupted, incomplete, or inaccurate data to prevent a model from learning incorrect patterns.
- ๐ Feature engineering involves creating new features from existing data, which can improve a model's predictive power by making the data more informative.
- ๐ The importance of data's relevance to the task at hand cannot be overstated, as using inappropriate data can lead to inaccurate and biased models.
Q & A
What was the purpose of Amazon's experimental ML-driven recruitment tool?
-Amazon's experimental ML-driven recruitment tool was designed to screen resumes and give job applicants scores ranging from one to five stars, similar to the Amazon rating system, to help identify the best candidates.
Why did Amazon's machine learning model for recruitment turn out to be biased?
-Amazon's machine learning model for recruitment became biased because it was trained on a dataset that predominantly consisted of resumes from men, which led the model to penalize resumes containing the word 'women's'.
What is the significance of data quality in machine learning projects?
-Data quality is crucial in machine learning projects as it directly influences the model's performance. The principle 'garbage in, garbage out' applies, meaning that feeding a model with inaccurate or poor quality data will result in poor outcomes, regardless of the model's sophistication or the data scientists' expertise.
What is the role of data preparation in machine learning?
-Data preparation is a critical step in machine learning, accounting for up to 80% of the time in a data science project. It involves transforming raw data into a form that best describes the underlying problem to a model and includes processes like labeling, data reduction, cleansing, wrangling, and feature engineering.
How does the size of the training dataset impact machine learning models?
-The size of the training dataset can significantly impact machine learning models. While there is no one-size-fits-all formula, generally, the more data collected, the better, as it is difficult to predict which data samples will bring the most value. However, the quality and relevance of the data are also crucial.
What is dimensionality reduction and why is it important in machine learning?
-Dimensionality reduction is the process of reducing the number of random variables under consideration, which can involve removing irrelevant features or combining features that contain similar information. It is important because it can improve the performance of machine learning algorithms by reducing complexity and computational resources required.
Why is data labeling necessary in supervised machine learning?
-Data labeling is necessary in supervised machine learning because it provides the model with the correct answers to the given problem. By assigning corresponding labels within a dataset, the model learns to recognize patterns and make predictions on new, unseen data.
How can data cleansing help improve the performance of machine learning models?
-Data cleansing helps improve the performance of machine learning models by removing or correcting incomplete, corrupted, or inaccurate data. By ensuring that the data fed into the model is clean and accurate, the model can make more reliable predictions and avoid being misled by poor quality data.
What is feature engineering and how does it contribute to machine learning?
-Feature engineering is the process of using domain knowledge to select or construct features that make machine learning algorithms work. It contributes to machine learning by creating new features that can better represent the underlying problem, thus potentially improving the model's performance and accuracy.
How does data normalization help in machine learning?
-Data normalization helps in machine learning by scaling the data to a common range, such as 0.0 to 1.0. This ensures that each feature contributes equally to the model's performance, preventing issues where features with larger numerical values might be considered more important than they actually are.
What are some challenges faced during data preparation for machine learning?
-Challenges faced during data preparation for machine learning include determining the right amount of data, ensuring data quality and relevance, dealing with imbalanced datasets, handling missing or corrupted data, and the time-consuming nature of the process. Addressing these challenges is key to the success of machine learning projects.
Outlines
๐ค The Pitfalls of Amazon's AI Recruitment Tool
Amazon's experimental machine learning recruitment tool, which aimed to score job applicants similarly to its rating system, was discontinued due to gender bias. The model was trained on a decade's worth of resumes, predominantly from men, leading to skewed results. This highlights the critical importance of data quality in machine learning, as a faulty dataset can lead to significant issues. The video emphasizes the need for diverse and representative data to prevent algorithmic bias.
๐ The Crucial Role of Data in Machine Learning
The video discusses the preparatory steps for machine learning, starting with defining the problem and collecting a training dataset. It stresses that there's no universal formula for the optimal dataset size, as it depends on various factors including the problem's complexity and the learning algorithm used. Examples are given, such as Google's Gmail smart replies and Google Translate, which required millions of samples. The video also points out that data quantity and quality are both essential, with the latter being particularly important to avoid 'garbage in, garbage out' scenarios.
๐ Data Preparation Techniques for Machine Learning
This section delves into the intricacies of data preparation for machine learning, including labeling, reduction, cleansing, and wrangling. Labeling involves assigning correct answers to examples, akin to teaching a child. Data reduction and cleansing involve removing irrelevant or corrupt data to enhance model performance. The video also touches on dimensionality reduction to simplify complex data and sampling to manage large datasets. It concludes by emphasizing the importance of data preparation in machine learning, which can consume up to 80% of a data science project's time.
Mindmap
Keywords
๐กMachine Learning
๐กBias
๐กDataset
๐กData Preparation
๐กLabeling
๐กFeature
๐กData Reduction
๐กData Cleansing
๐กData Wrangling
๐กFeature Engineering
๐กNormalization
Highlights
Amazon's experimental ML recruitment tool was designed to score job applicants' resumes.
The tool was found to be biased, favoring male candidates over female ones.
The project was shut down in 2018 due to its sexist tendencies.
The bias was attributed to a faulty dataset used for training the model.
Data quality is critical for the success of machine learning projects.
There's no fixed formula for the optimal size of a training dataset.
Google's Smart Reply feature was trained on 238 million sample messages.
Google Translate required trillions of examples for its development.
AI Chang from Tam Kang University used a smaller dataset to predict concrete strength.
The size of training data depends on the complexity of the project.
Data quality is as important as its quantity for effective machine learning.
Data should be relevant and adequate for the task at hand.
Data preparation includes labeling, which is crucial for supervised learning.
Labeling data is akin to teaching a child to recognize objects.
Features are the measurable characteristics that describe data to a model.
Data reduction and cleansing are essential steps in data preparation.
Dimensionality reduction helps improve model performance by focusing on relevant features.
Sampling can be used to manage large datasets and address imbalanced class distributions.
Data cleansing involves dealing with incomplete, corrupted, or inaccurate data.
Data wrangling transforms raw data into a format that is understandable for a model.
Normalization ensures that all features have equal importance in the model.
Feature engineering involves creating new features that can improve model efficiency.
Data preparation is a critical and time-consuming step in machine learning projects.
The quality of training data directly impacts the accuracy and fairness of AI models.
Transcripts
in 2014 amazon started working on its
experimental ml driven recruitment tool
similar to the amazon rating system the
hiring tool was supposed to give job
applicants scores ranging from one to
five stars when screening resumes for
the best candidates
yeah the idea was great but
it seemed that the machine learning
model only liked men
it penalized all resumes containing the
word women's as in women's softball team
captain
in 2018 reuters broke the news that
amazon eventually had shut down the
project
now the million-dollar question
how come amazon's machine learning model
turned out to be sexist a ai goes rogue
b inexperienced data scientists
c
faulty data set d
alexa gets jealous
the correct answer is c
faulty dataset
not exclusively of course getdata is one
of the main factors determining whether
ml projects will succeed or fail in the
case of amazon models were trained on 10
years worth of resumes submitted to the
company for the most part by men
so here's another million dollar
question
how is data prepared for machine
learning
all the magic begins with planning and
formulating the problem that needs to be
solved with the help of machine learning
pretty much the same as with any other
business decision
then you start constructing a training
data set and
stumble on the first rock
how much data is enough to train a good
model
just a couple samples
thousands of them or even more
the thing is there's no
one-size-fits-all formula to help you
calculate the right size of data set for
a machine learning model here many
factors play their role from the problem
you want to address to the learning
algorithm you apply within the model
the simple rule of thumb is to collect
as much data as possible because it's
difficult to predict which and how many
data samples will bring the most value
in simple words there should be a lot of
training data well a lot sounds a bit
too vague right
here are a couple of real-life examples
for a better understanding
you know gmail from google right
it's smart reply suggestions save time
for users generating short email
responses right away to make that happen
the google team collected and
pre-processed the training set that
consisted of 238 million sample messages
with and without responses as far as
google translate it took trillions of
examples for the whole project
but it doesn't mean you also need to
strive for these huge numbers
ai chang yay tam kang university
professor used the data set consisting
of only 630 data samples with them he
successfully trained the model of a
neural network to accurately predict the
compressive strength of high performance
concrete as you can see the size of
training data depends on the complexity
of the project in the first place at the
same time it is not only the size of the
data set that matters but also its
quality
what can be considered as quality data
the good old principle garbage in
garbage out states a machine learns
exactly what it's taught feed your model
inaccurate or poor quality data and no
matter how great the model is how
experienced your data scientists are or
how much money you spend on the project
you won't get any decent results
remember amazon that's what we're
talking about
okay it seems that the solution to the
problem is kind of obvious avoid the
garbage in part and you're golden but
it's not that easy
say you need to forecast turkey sales
during the thanksgiving holidays in the
u.s but the historical data you're about
to train your model on encompasses only
canada you may think thanksgiving here
thanksgiving there what's the difference
to start with canadians don't make that
big of a fuss about turkey the bird
suffers an embarrassing loss in the
battle to pumpkin pies also the holiday
isn't observed nationwide not to mention
that canada celebrates thanksgiving in
october not november chances are such
data is just inadequate for the u.s
market this example shows how important
it is to ensure not only the high
quality of data but also its adequacy to
the set task
then the selected data has to be
transformed into the most digestible
form for a model so you need data
preparation
for instance in supervised machine
learning you inevitably go through a
process called labeling
this means you show a model the correct
answers to the given problem by leaving
corresponding labels within a data set
labeling can be compared to how you
teach a kid what apples look like
first you show pictures and you see that
these are well apples then you repeat
the procedure when the kid has seen
enough pictures of different apples the
kid will be able to distinguish apples
from other kinds of fruit
okay what if it's not a kid that needs
to detect apples and pictures but a
machine the model needs some measurable
characteristics that will describe data
to it such characteristics are called
features
in the case of apples the features that
differentiate apples from other fruit on
images are their shape color and texture
to name a few just like the kid when the
model has seen enough examples of the
features it needs to predict it can
apply learned patterns and decide on new
data inputs on its own
when it comes to images humans must
label them manually for the machine to
learn from
of course there are some tricks like
what google does with their recaptcha
yeah just so you know you've been
helping google build its database for
years every time you proved you weren't
a robot
but labels can be already available in
data for instance if you're building a
model to predict whether a person is
going to repay a loan you'd have the
loan repayments and bankruptcy's history
anyway
it's so cool and easy in an ideal world
in practice there may be issues like
mislabeled data samples
getting back to our apple recognition
example
well you see that the third part of
training images shows peaches marked as
apples if you leave it like that the
model will think that pages are apples
too
and that's not the result you're looking
for so it makes sense to have several
people double check or cross-label the
data set
of course labeling isn't the only
procedure needed when preparing data for
machine learning one of the most crucial
data preparation processes is data
reduction and cleansing
wait what reduce data clean it shouldn't
we collect all the data possible
well you do need to collect all possible
data but it doesn't mean that every
piece of it carries value for your
machine learning project so you do the
reduction to put only relevant data in
your model picture this you work for a
hotel and want to build an ml model to
forecast customer demand for twin and
single rooms this year you have a huge
data set with different variables like
customer demographics and information on
how many times each customer booked a
particular hotel room last year what you
see here is just a tiny piece of a
spreadsheet
in reality there may be thousands of
columns and rows
let's imagine that the columns are
dimensions on the 100 dimensional space
with rows of data as points within that
space
it will be difficult to do since we are
used to three space dimensions but each
column is really a separate dimension
here and it's also a feature fed as
input to a model the thing is when the
number of dimensions is too big and some
of those aren't very useful the
performance of the machine learning
algorithms can decrease
logically you need to reduce the number
right
that's what dimensionality reduction is
about
for example you can completely remove
features that have zero or close to zero
variance
like in the case of the country feature
in our table since all customers come
from the us the presence of this feature
won't make much impact on the prediction
accuracy there's also redundant data
like the year of birth feature as it
presents the same info as the age
variable why use both if it's basically
a duplicate
another common pre-processing practice
is sampling
often you need to prototype solutions
before actual production
if collected data sets are just too big
they can slow down the training process
as they require larger computational and
memory resources and take more time for
algorithms to run on with sampling you
single out just a subset of examples for
training instead of using the whole data
set right away speeding the exploration
and prototyping of solutions
sampling methods can also be applied to
solve the imbalanced data issue
involving data sets where the class
representation is not equal
that's the problem amazon had when
building their tool
the training data was imbalanced with
the prevailing part of resumes submitted
by men making female resumes a minority
class
the model would have provided less
biased results if it had been trained on
a sampled training data set with a more
equal class distribution made prior
what about cleaning them
data sets are often incomplete
containing empty cells meaningless
records or question marks instead of
necessary values
not to mention that some data can be
corrupted or just inaccurate
that needs to be fixed it's better to
feed a model with imputed data than
leave blank spaces for it to speculate
as an example you fill in missing values
with selected constants or some
predicted values based on other
observations in the data set
as far as corrupted or inaccurate data
you simply delete it from a set
okay data is reduced and cleansed here
comes another fun part data wrangling
this means transforming raw data into a
form that best describes the underlying
problem to a model the step may include
such techniques as formatting and
normalization
well these words sound too
techy but they aren't that scary
combining data from multiple sources may
not be in a format that fits your
machine learning system best for example
collected data comes in xls file format
but you need it to be in plain text
formats like dot csv so you perform
formatting
in addition to that you should make all
data instances consistent throughout the
data sets
say a state in one system could be
florida in another it could be fl pick
one and make it a standard
you may have different data attributes
with numbers of different scales
presenting quantities like pounds
dollars or sales volumes for example you
need to predict how much turkey people
will buy during this year's thanksgiving
holiday consider that your historical
data contains two features the number of
turkeys sold and the amount of money
received from the sales
but here's the thing the turkey quantity
ranges from 100 to 900 per day while the
amount of money ranges from 1500 to 13
000 if you leave it like this some
models may consider that money values
have higher importance to the prediction
because they are simply bigger numbers
to ensure each feature has equal
importance to model performance
normalization is applied
it helps unify the scale of figures from
say 0.0 to 1.0 for the smallest and
largest value of a given feature one of
the classical ways to do that is the min
max normalization approach
for example if we were to normalize the
amount of money the minimum value 1500
is transformed into a zero the maximum
value 13 000 is transformed into one
values in between become decimals say
2700 will be 0.1 and 7 000 will become
0.5 you get the idea
up until now we've been talking about
working with only those features already
present in data sometimes you deal with
tasks that require the creation of new
features
this is called feature engineering
for instance we can split complex
variables into parts that can be more
useful for the model say you want to
predict customer demand for hotel rooms
in your data set you have date time
information in its native form that
looks like this
you know that demand changes depending
on days and months you have more
bookings during holidays and peak
seasons on top of that your demand
fluctuates depending on specific time
say you have more bookings at night and
much fewer in the morning if that's the
case both time and date information have
their own predictive powers to make the
model more efficient you can decompose
the date from the time by creating two
new numerical features one for the date
and the other for the time
a machine learning model can only get as
smart and accurate as the training data
you're feeding it it can't get biased on
its own it can't get sexist on its own
it can't get anything on its own
and while the unfitting data set wasn't
the only reason for the amazon ai
project failure it still owned a lion's
share of the result
the truth is there are no flawless data
sets but striving to make them flawless
is the key to success that's why data
preparation is such a crucial step in
the machine learning process and that's
why it takes up to 80 percent of every
data science project's time
speaking of projects more information
can be found in our videos about data
science teams and data engineering
thank you for watching
Browse More Related Video
AI: Training Data & Bias
What is Data Labeling? Its Types, Role, Challenges and Solutions | AI Data Labeling Services
ใใฅใผใฉใซใใใใฏใผใฏใฎๆง่ฝใๆฑบๅฎใฅใใใใผใฟใฎ้ใจ่ณช
Challenges in Machine Learning | Problems in Machine Learning
Episode 09: Machine Learning and AI
AI Unveiled Beyond the Buzz: Episode 5
5.0 / 5 (0 votes)