Gender Classification From Vocal Data (Using 2D CNNs) - Data Every Day #090

Gabriel Atkin

8 Nov 202025:37

Summary

TLDRIn this video, the creator explores gender recognition using vocal features with a dataset. Initially, a traditional two-layer neural network is implemented to predict gender based on acoustic properties. The model achieves impressive accuracy (98%) and AUC scores. For fun, the creator then reshapes the data into 2D images and applies a convolutional neural network (CNN). Although the CNN shows comparable results, the traditional model remains slightly superior. The video highlights different machine learning techniques while experimenting with unconventional methods, ultimately encouraging viewers to explore creative approaches.

Takeaways

🎙️ The dataset used in this project is for gender recognition based on vocal data, where statistics such as vocal ranges are provided.
📊 The dataset includes various features like means, mins, maxes, and ranges, created from recorded voice samples to identify male or female voices based on acoustic properties.
🧠 The model used starts with a traditional neural network architecture, which includes two hidden layers for prediction.
📉 The labels (male and female) were encoded using `LabelEncoder`, mapping 'male' to 1 and 'female' to 0.
🧪 The data is split into training and test sets, and features are standardized using `StandardScaler` to give each column a mean of 0 and unit variance.
💻 A simple dense neural network model was created using TensorFlow, with a 20-feature input layer, 64 neurons in two hidden layers, and a single output neuron using a sigmoid activation function for binary classification.
🏆 The model achieved a high accuracy of 98% and an AUC of 0.99, indicating excellent performance in classifying the voices.
🧪 For experimentation, the data was reshaped into 5x5 matrices and passed through a Convolutional Neural Network (CNN), though this was mainly for curiosity.
📊 Even though the CNN approach was unconventional, it yielded decent results with a slightly lower accuracy (95%) but a higher AUC (0.999), showing potential.
🚀 The creator of the script suggests further exploration of CNNs and other methods to optimize results but recognizes the simple two-layer neural network as the most efficient solution for this dataset.

Q & A

What is the purpose of the dataset in the video?
-The dataset is designed for gender recognition by voice, where acoustic properties of recorded voice samples are analyzed to identify if a voice is male or female.
How does the video creator preprocess the labels in the dataset?
-The video creator uses the `LabelEncoder` from `sklearn` to transform the 'label' column, converting 'male' and 'female' into numerical values (0 for female, 1 for male).
What is the main model used for gender prediction in the video?
-The main model used is a two-hidden-layer neural network implemented with TensorFlow and Keras. It takes vocal features as input and outputs a probability estimate for predicting gender.
Why does the video creator scale the data before training the model?
-The creator scales the data using `StandardScaler` to normalize the features, ensuring they have a mean of 0 and unit variance. This makes it easier for the neural network to learn by putting all features on a similar scale.
What results did the creator achieve with the initial neural network model?
-The initial neural network model achieved 98% accuracy and an AUC (Area Under the Curve) score of 0.99 on the test set.
What alternative approach did the video creator try, and why?
-The creator tried using a Convolutional Neural Network (CNN) by reshaping the vocal data into 2D 'image' matrices. This approach was mostly for experimentation and curiosity, as CNNs are typically used for image data.
How did the video creator handle the fact that the vocal feature vectors had only 20 elements when trying to use a CNN?
-Since 20 is not a perfect square, the creator padded the vectors with zeros to make them 25 elements long, which can then be reshaped into a 5x5 matrix for input into a CNN.
What was the result of using the CNN approach on this dataset?
-The CNN approach yielded slightly lower accuracy (95%) but a higher AUC score (0.998) compared to the simpler neural network, indicating it performed well but did not surpass the simpler model in accuracy.
Why does the creator include an early stopping callback during training?
-Early stopping is used to monitor the validation loss and stop training when the loss stops improving for a few epochs, preventing overfitting and saving the best weights during training.
What visualization technique did the creator use to display the newly structured data as 'images'?
-The creator used `matplotlib` to display the reshaped 5x5 matrices as images. These images represented the vocal data after padding, with zero values displayed as a solid color.