Gender Classification From Vocal Data (Using 2D CNNs) - Data Every Day #090
Summary
TLDRIn this video, the creator explores gender recognition using vocal features with a dataset. Initially, a traditional two-layer neural network is implemented to predict gender based on acoustic properties. The model achieves impressive accuracy (98%) and AUC scores. For fun, the creator then reshapes the data into 2D images and applies a convolutional neural network (CNN). Although the CNN shows comparable results, the traditional model remains slightly superior. The video highlights different machine learning techniques while experimenting with unconventional methods, ultimately encouraging viewers to explore creative approaches.
Takeaways
- 🎙️ The dataset used in this project is for gender recognition based on vocal data, where statistics such as vocal ranges are provided.
- 📊 The dataset includes various features like means, mins, maxes, and ranges, created from recorded voice samples to identify male or female voices based on acoustic properties.
- 🧠 The model used starts with a traditional neural network architecture, which includes two hidden layers for prediction.
- 📉 The labels (male and female) were encoded using `LabelEncoder`, mapping 'male' to 1 and 'female' to 0.
- 🧪 The data is split into training and test sets, and features are standardized using `StandardScaler` to give each column a mean of 0 and unit variance.
- 💻 A simple dense neural network model was created using TensorFlow, with a 20-feature input layer, 64 neurons in two hidden layers, and a single output neuron using a sigmoid activation function for binary classification.
- 🏆 The model achieved a high accuracy of 98% and an AUC of 0.99, indicating excellent performance in classifying the voices.
- 🧪 For experimentation, the data was reshaped into 5x5 matrices and passed through a Convolutional Neural Network (CNN), though this was mainly for curiosity.
- 📊 Even though the CNN approach was unconventional, it yielded decent results with a slightly lower accuracy (95%) but a higher AUC (0.999), showing potential.
- 🚀 The creator of the script suggests further exploration of CNNs and other methods to optimize results but recognizes the simple two-layer neural network as the most efficient solution for this dataset.
Q & A
What is the purpose of the dataset in the video?
-The dataset is designed for gender recognition by voice, where acoustic properties of recorded voice samples are analyzed to identify if a voice is male or female.
How does the video creator preprocess the labels in the dataset?
-The video creator uses the `LabelEncoder` from `sklearn` to transform the 'label' column, converting 'male' and 'female' into numerical values (0 for female, 1 for male).
What is the main model used for gender prediction in the video?
-The main model used is a two-hidden-layer neural network implemented with TensorFlow and Keras. It takes vocal features as input and outputs a probability estimate for predicting gender.
Why does the video creator scale the data before training the model?
-The creator scales the data using `StandardScaler` to normalize the features, ensuring they have a mean of 0 and unit variance. This makes it easier for the neural network to learn by putting all features on a similar scale.
What results did the creator achieve with the initial neural network model?
-The initial neural network model achieved 98% accuracy and an AUC (Area Under the Curve) score of 0.99 on the test set.
What alternative approach did the video creator try, and why?
-The creator tried using a Convolutional Neural Network (CNN) by reshaping the vocal data into 2D 'image' matrices. This approach was mostly for experimentation and curiosity, as CNNs are typically used for image data.
How did the video creator handle the fact that the vocal feature vectors had only 20 elements when trying to use a CNN?
-Since 20 is not a perfect square, the creator padded the vectors with zeros to make them 25 elements long, which can then be reshaped into a 5x5 matrix for input into a CNN.
What was the result of using the CNN approach on this dataset?
-The CNN approach yielded slightly lower accuracy (95%) but a higher AUC score (0.998) compared to the simpler neural network, indicating it performed well but did not surpass the simpler model in accuracy.
Why does the creator include an early stopping callback during training?
-Early stopping is used to monitor the validation loss and stop training when the loss stops improving for a few epochs, preventing overfitting and saving the best weights during training.
What visualization technique did the creator use to display the newly structured data as 'images'?
-The creator used `matplotlib` to display the reshaped 5x5 matrices as images. These images represented the vocal data after padding, with zero values displayed as a solid color.
Outlines
🎤 Introduction to Gender Recognition by Voice Dataset
The video begins with an introduction to a gender recognition task using a dataset that includes vocal statistics. The dataset is composed of acoustic properties of voice samples, which are analyzed to classify them as either male or female. The speaker explains that the goal is to use vocal features, such as means, minimums, and maximums, to predict gender, and outlines the plan to experiment with two neural networks: a traditional multi-layer neural network and a convolutional neural network (CNN).
🛠️ Preparing the Dataset and Encoding Labels
The second section dives into data preprocessing. The speaker uses Pandas to load the dataset and examines the columns, noting that the data is already numerical. They highlight that the label column needs encoding and use the LabelEncoder from Scikit-learn to convert gender labels (male and female) into binary values (0 for female, 1 for male). Additionally, they confirm that there are no missing values in the dataset and demonstrate how to map encoded labels back to their original categories.
📊 Scaling and Splitting the Dataset
This part discusses scaling the data to make it easier for the model to learn. A StandardScaler is applied to ensure all features have mean 0 and unit variance. The data is then split into training and test sets using the train_test_split function from Scikit-learn, with 70% of the data used for training and the rest for testing. The speaker also confirms that the feature set consists of 20 features with 3,168 examples in total.
🧠 Building a Basic Neural Network
In this section, the speaker outlines the process of building a traditional neural network using TensorFlow. The model consists of an input layer (20-dimensional feature vector), two hidden dense layers with 64 neurons each, and an output layer with a single neuron using a sigmoid activation function to predict gender. The speaker compiles the model using the Adam optimizer, binary cross-entropy loss function, and accuracy and AUC as metrics. The model is trained for 100 epochs, with early stopping to prevent overfitting.
🎯 Evaluating the First Model
After training the neural network, the model is evaluated on the test set, achieving an impressive 98% accuracy and 0.99 AUC. The speaker expresses satisfaction with the performance, noting that further tweaks might slightly improve the model. Despite the excellent results, the speaker decides to experiment with a CNN, acknowledging that the existing model is already highly effective.
🖼️ Experimenting with 2D CNN by Reshaping Data
Here, the speaker explores the idea of using a 2D Convolutional Neural Network (CNN) by reshaping the 20-dimensional feature vectors into 5x5 matrices (padding the vectors with zeros to create square images). This approach, inspired by a Kaggle competition, allows the model to treat the data as images. After resolving some technical issues with padding and data types, the speaker successfully reshapes the data into 5x5 matrices, ready for use in a CNN.
🏗️ Building the CNN Model
A new CNN model is built using two Conv2D layers, each followed by max-pooling layers. The Conv2D layers use 16 and 32 filters respectively, with kernel sizes of 3 and 2. After flattening the output from the convolutional layers, the speaker applies a dense layer at the end to output the prediction. Despite some initial issues with kernel sizes and pooling, the model is constructed and compiled successfully.
🧪 Evaluating the CNN and Comparing Results
After training the CNN, the speaker evaluates its performance. While the CNN achieves a slightly lower accuracy (95%) compared to the traditional neural network, it shows an impressive AUC score of 0.998. The speaker reflects on how the CNN approach is unconventional but effective, and considers further experiments, such as using rectangular images instead of padding. They acknowledge that the simpler neural network still outperforms but appreciate the exploratory value of the CNN approach.
👋 Wrapping Up and Final Thoughts
In the final section, the speaker summarizes the key takeaways from the video. They express excitement about the results from both models, particularly the excellent performance of the basic neural network. Despite the exploratory nature of the CNN experiment, it also yielded promising results. The speaker encourages viewers to subscribe for more content and thanks them for watching, concluding with a positive farewell.
Mindmap
Keywords
💡Gender recognition by voice
💡Neural network
💡Convolutional neural network (CNN)
💡Label encoding
💡Data scaling
💡Train-test split
💡Binary cross-entropy
💡Early stopping
💡AUC (Area Under the Curve)
💡Max pooling
Highlights
Introduction of a gender recognition dataset based on vocal range data.
The dataset includes vocal statistics such as means, mins, maxes, and ranges derived from recorded voice samples.
Plan to predict gender using neural networks, including a two-layer neural network and a convolutional neural network (CNN).
Initial setup includes loading essential libraries like NumPy, Pandas, Matplotlib, and TensorFlow for model building.
Data preprocessing: Encoding labels (male/female) into binary form (0 or 1) using Scikit-learn’s LabelEncoder.
Scaling features using StandardScaler to ensure all columns have a mean of 0 and unit variance for easier model learning.
Initial model: A simple two-hidden-layer neural network using TensorFlow, achieving 98% accuracy in gender prediction.
The architecture of the initial neural network is explained, with a focus on dense layers and sigmoid activation to predict probabilities.
Introduction of early stopping with TensorFlow’s callback to avoid overfitting by monitoring validation loss.
The first neural network achieves outstanding results: 98% accuracy and an AUC of 0.99, highlighting the efficiency of the model.
For experimentation, the plan is to transform the feature vectors into 2D image-like data to apply CNNs.
Reshaping the 20-feature vector into a 5x5 matrix, using padding to create a square matrix for CNN input.
Building a CNN with two convolutional layers and pooling layers to see how the model performs on the transformed data.
Results of the CNN approach show a 95% accuracy and slightly higher AUC compared to the traditional neural network.
The CNN approach, despite being unconventional for this type of data, shows promising results with near-excellent performance.
Transcripts
[Music]
hi guys
welcome back to data everyday um today
we're looking at
a gender recognition by voice data set
i mean this is really the task the data
set is
um they're records from different people
and these are statistics about
vocal ranges and
it's basically vocal data you can see we
have
a lot of uh like means mins max's
ranges it says here this database was
created to identify a voice as male or
female based on upon acoustic properties
of the voice and speech so it actually
comes from
uh recorded voice samples which were
then analyzed
and built create these features were
created from them
so let's get into the notebook uh what
i'm going to try to do is like
um as as it would like us to do
try to predict the gender of a given
person based on
these vocal features and we're going to
use
i wrote two different neural networks um
one a cnn and this is mainly just for
fun
uh what we're going to do is uh we're
first going to just use a traditional
two hidden layer
neural network and then we're going to
try something else we're going to
restructure the data
into like two dimensional images
and we're going to try to use a
convolutional neural network on that
alright so let's get started i have
numpy pandas and matplotlib
the essentials then i have for
pre-processing
label encoder standard scalar and the
train test split function from sklearn
and then i have tensorflow which i'm
also going to use a number of uh
functions from keras module
tensorflow all right let's load in the
data
using pandas.readcsv
and we get the file path up here
voice.csv
just copy that paste it in and take a
look
and you notice first thing is that we
can't see all the columns so i'm going
to go into the console
and write pandas dot set option
max columns
none and that will give us all the
columns
and you can see they're already in
numerical form because
these features were created from voice
samples
and we just noticed that the label
column needs to be encoded
before we go any further we should check
if we have any missing values although i
doubt we will
yeah no missing values 3 3 100
entries and you can see there's not no
non-nulls
sorry no nulls in any of the columns
all right let's encode the labels
so i'm just going to use sklearns a
label encoder for this
which uh we just create a new object and
then
the column we want to encode is called
label so data sub
label equals label encoder
dot fit transform
data sub label and that will just
change so change it so that uh male and
female are assigned zero or one
and we'll run that and we can actually
take a look at um
you can take a look at which values were
mapped to which
by enumerating label encoder
dot classes underscore and when we
enumerate it we can then turn it into a
dictionary to get a mapping
of what went to which and you can see
zero went to female one went to mill
other way around female into zero male 1
to one
so you can see if we were to look at
data now
we have ones and zeros whereas before we
had male and female
okay let's split and scale the data
so we're going to split it into x and y
y is what we're trying to predict
just so just our label column data sub
label
and i'll make a deep copy of it and x is
going to be everything except
label so we're going to drop it from
axis one
make a copy of that and now we have
split our data into two
sections y is just a vector x is a
matrix
and let's create a scalar
standard scalar this is
a scalar from sk learn that will give
each column
in x mean 0 and unit variance
so all of the columns will take on a
similar range of values
it makes it easier for our model to
learn
so x equals scalar dot fit transform
x simple as that
now if we look at x we no longer have a
data frame but you can see
the values have been scaled so that they
all take around
they all have mean zero and most of the
values lie
in the negative one to one range
okay so now we'll split it
uh horizontally
or vertically i don't know how you said
would you want to call it but uh what i
mean is let's get a training test set
so uh x trade and x test y train y test
equals uh train test split x y so this
function from escape learn we'll just
split our xy into a training test set
we'll give a train size of 70
why don't i include a random state as
well how about uh
42
all right now we have four different
sets for data
and we can begin modeling and training
so let's take a look at our feature data
so
we have 20 features and 3168 examples
i'm going to start building a tensorflow
neural network just the most standard
architecture which is start with
a dense layer sorry an input
and we'll pass in the shape of
a single feature sorry single feature
vector will be
20 a vector of length 20. so i can
access that with
x dot shape sub 1 and put the combo to
indicate a vector
then x equals tf.carus.layers.dense
there's a dense layer we'll give it 64
activations
and a relu activation function
pass it in inputs and i'm going to copy
that and make a second one but pass an
x so it's going to go through two hidden
layers very standard
and then given outputs which will be
another dense layer but it will only
output one
value which will be
a probability estimate so sigmoid scales
are between zero and one
so that we get a probability for how
likely a given person is male
all right we'll create our model which
will be tf.model
and we're passing inputs and outputs all
right
so let's take a look we'll use
model.summary to see
what our how our shape is changing we
start off with a feature
vector of length 20.
it gets uh it goes
to the dense layer the first dense layer
which has 64
nodes and then those 64 nodes get
connected to the next 64 nodes
and those final 64 nodes all
are there's a linear combination that
returns a single value
from all 64. and that single value is
if it's over 0.5 we'll say it's male and
if it's under 0.5 we'll say female
so let's uh compile our model
so we'll give an optimizer of atom
uh for loss we'll give binary cross
entropy
and metrics how about we include
accuracy and auc auc is just uh
much better at uh
it considers performance within each
class rather than just
pure how much how well did we do
across all examples so we'll give that a
name auc
all right then i'm going to fit the
model and store it store the history of
the fit
in history it's a model.fit
we're fitting on the train set so
x-trade and y-train
i'll give it a validation split of 20
percent
a batch size of 32 and we'll train for
100 epochs
because i choose such a high number of
epochs because i'm also going to include
a callback function
just tf.keras dot callbacks
dot early stopping this allows us to
monitor a value in this case validation
loss
and when we notice that the loss stops
improving or stops decreasing
we will wait for a certain number of
epochs
say three and if it's still
still not decreasing after three epochs
we're going to stop the training
and restore the weights from the best
epoch
so restore best weights equals true
all right we'll run that and should stop
after some number
think nine let's see how we did
model dot evaluate x test
y test so we evaluate on the test set we
have an accuracy of 98 percent
and an auc of 0.99 so absolutely
fantastic
um really couldn't hope for a better
performance than this
perhaps we could actually improve if we
just tweak a few things
make it perfect but i'm not going to
spend too much time this video doing
that
i would like to actually try a different
approach
now i do not expect i'll just say this
before i do it
i do not expect this approach to yield
greater performance this is absolutely
fantastic i have no reason to change
this
if i were really caring about getting
the best performing model i'd probably
keep something simple like this
but i want to try to use 2d cnn's just
for
just for fun just to see if we can do
this
and what i mean is so a two-dimensional
convolutional
layer takes in an image essentially
or a matrix of pixel
data it doesn't have to be pixel data
actually it just has to be
a two-dimensional matrix and
uh if you don't know the math behind
convolutional layers you should go
check them out very cool basically it
just slides this little
um it it
it takes data from the image
from little sections of the image i'm
not going to go into it
in detail but um basically what we're
going to do is we're going to
reformat our ex our sequences
well here let me show you what i mean
here's x right
uh if i do it like a date as a data
frame
let me just take a look better look at
it it the
x is basically each example
is a sequence of values of 20 values
right and it's in a one-dimensional
vector
now we could reshape this vector so that
we can stack it into
like a square and then use that square
as the two-dimensional
image that we can feed into our
compositional network
i got this idea from someone on kaggle
was talking about how they
they got um some interesting results
using this in a competition i can't
remember exactly but
this is just an idea i had let's see
let's see how it goes
so how am i going to do this so we have
to work with x
x is our our feature information but
currently it's in this
this format of just these long uh
vectors
so each example is a vector of length
20.
so what i want to think of is if i
wanted to make
that into a square uh i'm going to need
it to be like
for example yes it will be the length of
the original vector has to be a perfect
square
and the next highest perfect square from
20 is 25
obviously if we need the we need it to
be of equal dimension for it to be a
square
so if i look at the shape of x
currently it's of line 20 and what i can
do is pad
the sequences using
tf.keras.preprocessing.sequence.pad
sequences
and this uh function will take uh
let's pass an x and we'll set a max
length
to 25 and i'm going to say padding
equals post
what this will do is take all of our um
our 20 or our vectors of length 20
and add five zeros to end of each one
so if i look at the shape of this you
can see it's the same but there's
five extra values at the end and if i
wanted to
look at this as a data frame so i'll
take shape off the end
you can see uh
oh very interesting hmm
so i didn't know about this actually
we're losing some information here
i had no idea it it turns them into
integers and that we can't have that
absolutely can't um so we're gonna have
to figure out how
to avoid this is there a way let me look
this up
pad sequences
is there a way to keep it from doing
that
d type oh yeah let's specify d type
d type equals numpy.float
okay that's good alright we're good to
go
awesome so maybe i'll get some better
performance than than what i had before
because i didn't realize that it was uh
stripping off the decimal values all
right
so this is going to be it's exactly the
same as before but we have these five
extra zeros at the end
and the reason for adding those five is
because now
let's just take this data frame view off
so let's make that the new x so now x
has these extra zeros at the end
we can reshape it now
to keep the same number of elements in
the first dimension
but change the other dimensions to be
five by five
and if you look at that um let's just
get the shape of that
it is now instead of uh 3168
by 25 the 25 has been restructured into
5x5
arrays so um
all right let's let's see
there's one last thing usually um so
that's
let's make that new x take shape off
and then the last thing to do is um
usually
image data has an extra dimension to
represent the number of color channels
so right now the shape of x is uh
3168 by five by five i want to make a
3168 by five
by five by one and to do that i can
use numpy dot expand dimensions
expand dims
on x and the axis we want to expand
across is 3
which will just be the fourth axis there
so run that now if we look at the shape
we have this extra dimension
and we have a nice image format
let me just put these same block
and we have this uh yeah okay so let's
actually take a look at these as images
so because they're in an image form we
can actually view them
as images so
let's create a new map plot lib figure
give it a fix size
i don't know it could be anything to a
12 by 12 sounds good
and then for i in range nine so i'm just
going to display nine of the images
we're going to create a new subplot in a
three by three grid
indexed by i plus one and then
we'll use plt dot image show or i am
show
of x sub i
and that will give us the first image of
of
size five by five by one actually i'm
pretty sure
we need to squeeze this i don't i think
image show doesn't like the extra one
so let's do numpy.squeeze
just to get rid of that extra one we're
still going to use it in the original x
but uh
for the image show function we want to
get rid of it and then i'll just turn
off the axis
on the side the axis marks
all right and then plt.show
and these are our new
feature images and you'll notice the
line across the bottom
is always zero so we always have it as a
as a solid color because these are
our pad zeros um
so this is fantastic right
uh
hmm what does it mean
so this is actually just a uh
the the brighter the color the higher
the value i believe
um actually so zero is like the solid
color
and then darker colors are negative and
brighter colors are positive
uh and so each one of these squares
actually
is represented it is representing
one of these values so there's like 20
different values like this across and
each one of those is represented by a
new square now
so it looks like colors to us but to the
to the algorithm it's still just
numerical data so
we'll be able to feed this into our
model
all right so now we have a new x uh
let's create a new x train x test
y train y test
train test split x y
train size of seventy 70 and same random
state as before
42. all right and now let's build
a new model so before our model looks
like this right let's copy this down
into here and oops
instead of using just two hidden layers
like that
i'm going to use the standard cnn
architecture which is
going to use a we'll do a convolutional
so conf 2d layer
that takes a
the number of filters which will make 16
and then the size of a kernel which will
make three
and an activation function which will be
relu
this will take in inputs and then we'll
have
a max pooling layer
max pulling 2d
and we'll pass an x to that and so i'm
going to copy this three times
whoops
i make sure these are x and this is
going to be 32
and this will be 64.
then when we're done we'll flatten it
with a flattened layer
and we'll feed it through a final dense
layer at the end
all right so let's see how this goes
uh we have problem incompatible
oh i didn't pass anything in here
no we're still getting a problem conf
2d2 is incompatible with layer
oh i specified the wrong shape here so
no longer are we dealing with just a
single vector we
need three values five by five by one
so shape would be five by five by one
but i'll just represent it in terms of x
dot shape
x dot shape sub one x dot shape sub
two and x dot shape
sub three
oh what is this negative dimension
uh
oh wait
let me try just copy and pasting what i
have before
okay um give me one second
okay i figured i figured out what i did
wrong um i just had
um i my kernel sizes were just way too
big
um for the pooling that was going on
we are our images are so small that we
can't we were trying to pull into
negative dimensions
so uh don't worry about that we're just
going to keep it with two convolutional
layers
a kernel size of two here kernel size of
one here and see how that works
uh so right so we're taking in this
image and we're going to run through the
convolutional layer max pooling layer
convolutional air max pulling layer then
we'll flatten it out into a single
vector
apply it through a last dense layer and
then give our output
so if we look at uh here you can see the
shape
converges down to it starts like this
sort of goes up a little comes back down
into
one all right so uh let's
let's now train okay so i'm going to
grab
this code and put it over here
i just make sure everything's sort of
similar i don't think we have to change
anything
it should be the same all right i'll run
that
and let's evaluate the model when we're
done
and we actually get some pretty good
results
um so before me is just a standard
uh way we got a 98 accuracy and 0.997
auc
here we actually have a higher auc and
the lower accuracy by
only three percent um which is you know
that's pretty good
like um this is higher than i got before
i guess because i didn't realize i was
cutting out the uh float
values here um i realize
we may also be able to do this without
padding the zeros on the bottom
i'm not sure if that will contribute
uh well that will give us better results
we could just keep a rectangular image
and instead of making it a perfect
square
uh maybe a four by five image
but yeah so i mean this from some pretty
good results especially
uh considering how sort of like
unconventional this method
seems it seems like this still wins
just a better performance but
i would love to look into this more and
figure out if there is a way to make
this
more effective than the standard two
hidden layer
boring sequential model but you know
this is very cool
i hope you think it is also this is
going to wrap up today's video
thank you so much for watching i hope
you enjoyed the video
if you did make sure to subscribe and
hit the bell for more content
and leave any comments you have in the
section below i'll see you guys tomorrow
have a fantastic day
5.0 / 5 (0 votes)