In this article, we are going to build a project on Gender Recognition by Voice using Python in Machine Learning. This is an era of technology and everyone is using technology nowadays as it’s making our lives so easier. ML models are used in various sectors and also in real-time recognition as well. One of them is gender recognition using the Voice of a person which we are going to create today. Click here to get the complete code. Let’s get started!
Gender Recognition by Voice using Python: Project Overview
Project Name: | Gender Recognition by Voice using Python |
Abstract | This will be an ML-based recognition system that can be used to recognize the gender of a person using his/her voice. |
Language/s Used: | Python and Machine Learning Libraries |
IDE | Google Colab(recommended) |
Python version (Recommended): | Python 3.x |
Database: | Not required |
Type: | Program |
Recommended for | Final Year Students |
Steps involved in Gender Recognition by Voice
The workflow to build the Gender Recognition by Voice using Python Project is as follows-
- Collection of data
- Exploring the data
- Audio feature extraction
- Splitting the data
- Building the model
- Evaluating the model
1. Collection of data
The very first step is to choose the dataset for our model. We can get a lot of different datasets from Kaggle. You just need to sign in to Kaggle and search for any dataset you need for the project. The dataset we are going to use here is taken from here.
The entire dataset is pretty large which is more than 12 GB. We can use the Kaggle notebook feature so that we can work with the data without downloading them all to the local storage. Since the total size of the dataset is huge, what we can do next is to load only the data from the cv-valid-train folder in which the corresponding audio details are stored in the cv-valid-train.csv file.
2. Exploring the data
As we have decided which data files to work with, now we can get even deeper into the dataset to find out details in the dataset. To start with, we are going to load all the required modules for this project. Before that, we have to install the python_speech_features module.
# Install python_speech_features module
!pip install python_speech_features
# Import all modules
import os
import librosa
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from pydub import AudioSegment
from python_speech_features import mfcc
from time import time
# Load the csv file into data frame
df = pd.read_csv('../input/common-voice/cv-valid-train.csv')
Then, we have to load the data into our project. After all modules and the dataset are loaded, now we are able to check the first 5 rows of the dataset using df. head().
We can see that the dataset got plenty of missing values, including the gender itself. Since the values in the gender column are going to be our target, it is important to filter out the NaN values in that column. In order to do so, we can create two new data frames called df_male and df_female which are used to store male and female voice details respectively.
We got another problem that is the dataset is extremely unbalanced. We can solve this problem by performing the under-sampling method. This is basically done by taking only a small portion of the available data such that the class distribution is going to be equal.
# Create two new data frames
df_male = df[df['gender']=='male']
df_female = df[df['gender']=='female']
# Find out the number of rows
print(df_male.shape)
# output: (55029, 8)
print(df_female.shape)
# output: (18249, 8)
# Take only 300 male and 300 female data
df_male = df_male[:300]
df_female = df_female[:300]
Notice that the audio file names in our dataset are actually having the extension of mp3. Actually, we need to convert these into wav because the Librosa module is just unable to read digital signals stored in mp3 format. In order to do that, we can use a function called convert_to_wav() with the help of the AudioSegment object taken from the Pydub module.
# Define the audio path
TRAIN_PATH = '../input/common-voice/cv-valid-train/'
# The function to convert mp3 to wav
def convert_to_wav(df, m_f, path=TRAIN_PATH):
srcs = []
for file in tqdm(df['filename']):
sound = AudioSegment.from_mp3(path+file)
# Create new wav files based on existing mp3 files
if m_f == 'male':
sound.export('male-'+file.split('/')[-1].split('.')[0]+'.wav', format='wav')
elif m_f == 'female':
sound.export('female-'+file.split('/')[-1].split('.')[0]+'.wav', format='wav')
return
# How to use the convert_to_wav() function
convert_to_wav(df_male, m_f='male')
convert_to_wav(df_female, m_f='female')
After running the code above, now we should have new files stored in the working directory of the Kaggle notebook. We have to keep in mind that all these files might be saved in another directory instead depending on your specified path, especially when you use your local machine to do this project.
Up to this step, we already got 600 new wav files in which each of those are already having the prefix of either “male” or “female”. Therefore, we can ignore our data frame df that we used earlier and just focus on these new files as now we can extract the target label directly from the file name.
Furthermore, since the files are already having the extension of wav, then we can just employ the Librosa module to actually store the digital signal values into Python variables. Now we define the load_audio() function to take the raw audio data and directly use it.
# Define a function to load the raw audio files
def load_audio(audio_files):
# Allocate empty list for male and female voices
male_voices = []
female_voices = []
for file in tqdm(audio_files):
if file.split('-')[0] == 'male':
male_voices.append(librosa.load(file))
elif file.split('-')[0] == 'female':
female_voices.append(librosa.load(file))
# Convert the list into Numpy array
male_voices = np.array(male_voices)
female_voices = np.array(female_voices)
return male_voices, female_voices
# How to use load_audio() function
male_voices, female_voices = load_audio(os.listdir())
It is important to keep in mind that the values stored in both male_voices and female_voices consist of the raw digital wave itself followed by the sample rate. We can think of a sample rate like the FPS (Frame per Second) rate in videos. So, this basically means that there will be 22050 bits within 1 second of the audio clip.
3. Audio feature extraction
The machine learning algorithm works by labeling samples based on given features. We can think of the labels as y (dependent variable) and features as X (independent variables). In the case of voice recognition, feature extraction plays a big role since basically raw audio data is not quite informative and machine learning algorithms may unable to detect patterns in it.
There are actually plenty of audio feature extraction methods existing out there, and we will use MFCC due to its easy implementation and high recognition accuracy. The entire feature extraction process is wrapped in the extract_features() function.
# The function to extract audio features
def extract_features(audio_data):
audio_waves = audio_data[:,0]
samplerate = audio_data[:,1][1]
features = []
for audio_wave in tqdm(audio_waves):
features.append(mfcc(audio_wave, samplerate=samplerate, numcep=26))
features = np.array(features)
return features
# Use the extract_features() function
male_features = extract_features(male_voices)
female_features = extract_features(female_voices)
4. Splitting the data
Up to this, we already got a male_concatenated and female_concatenated array in which both of them store all male and female voices in two long arrays. Next, we are also going to concatenate the two arrays and store the result in X. y variable is necessary to be created as well in order to store all feature labels. Additionally, it is important to know that the labels here are in form of encoded values where 0 represents male and 1 denotes female. After that, we are going to split this X-y pair into training and testing data.
# Concatenate male voices and female voices
X = np.vstack((male_concatenated, female_concatenated))
# Create labels
y = np.append([0] * len(male_concatenated), [1] * len(female_concatenated))
# Check whether X and y are already having the exact same length
print(X.shape)
# Output: (242268, 26)
print(y.shape)
# Output: (242268,)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)
It is also worth knowing that by default train_test_split() function shuffles the data samples first prior to the splitting. The shuffling process is okay to do since actually, we do only care about the color of the voice instead of the utterances.
5. Building the model
We are going to use the SVM model as the voice classifier. We can also use any other model in this case, like logistic regression or even a neural network. Let’s see how to implement the SVM classifier.
# Initialize SVM model
clf = SVC(kernel='rbf')
# Train the model
start = time()
clf.fit(X_train[:50000], y_train[:50000])
print(time()-start)
# Output: 184.8018662929535 (seconds)
# Compute the accuracy score towards train data
start = time()
print(clf.score(X_train[:50000], y_train[:50000]))
# Output: 0.78204
print(time()-start)
# Output: 90.8693311214447 (seconds)
# Compute the accuracy score towards test data
start = time()
print(clf.score(X_test[:10000], y_test[:10000]))
# Output: 0.7679
print(time()-start)
# Output: 18.082067728042603 (seconds)
6. Evaluating the model
Now let’s evaluate this SVM classifier model. We can see that the model achieves 78.2% and 76.8% of accuracy towards training and testing data respectively. The confusion matrix on test data is constructed in which the result is shown in the next figure.
# Predict the first 10000 test data
svm_predictions = clf.predict(X_test[:10000])
# Create the confusion matrix values
cm = confusion_matrix(y_test[:10000], svm_predictions)
# Create the confusion matrix display
plt.figure(figsize=(8,8))
plt.title('Confusion matrix on test data')
sns.heatmap(cm, annot=True, fmt='d',
cmap=plt.cm.Blues, cbar=False, annot_kws={'size':14})
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
We can see that the Support Vector Machine algorithm with Radial Basis Function is able to identify whether a voice is spoken by a male (0) or female (1) with better accuracy.
Complete Code for Gender Recognition by Voice using Python
# Install python_speech_features module !pip install python_speech_features # Import all modules import os import librosa import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from tqdm import tqdm from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.metrics import confusion_matrix from pydub import AudioSegment from python_speech_features import mfcc from time import time # Load the csv file into data frame df = pd.read_csv('../input/common-voice/cv-valid-train.csv') # Create two new data frames df_male = df[df['gender']=='male'] df_female = df[df['gender']=='female'] # Find out the number of rows print(df_male.shape) # output: (55029, 8) print(df_female.shape) # output: (18249, 8) # Take only 300 male and 300 female data df_male = df_male[:300] df_female = df_female[:300] # Define the audio path TRAIN_PATH = '../input/common-voice/cv-valid-train/' # The function to convert mp3 to wav def convert_to_wav(df, m_f, path=TRAIN_PATH): srcs = [] for file in tqdm(df['filename']): sound = AudioSegment.from_mp3(path+file) # Create new wav files based on existing mp3 files if m_f == 'male': sound.export('male-'+file.split('/')[-1].split('.')[0]+'.wav', format='wav') elif m_f == 'female': sound.export('female-'+file.split('/')[-1].split('.')[0]+'.wav', format='wav') return # How to use the convert_to_wav() function convert_to_wav(df_male, m_f='male') convert_to_wav(df_female, m_f='female') # Define a function to load the raw audio files def load_audio(audio_files): # Allocate empty list for male and female voices male_voices = [] female_voices = [] for file in tqdm(audio_files): if file.split('-')[0] == 'male': male_voices.append(librosa.load(file)) elif file.split('-')[0] == 'female': female_voices.append(librosa.load(file)) # Convert the list into Numpy array male_voices = np.array(male_voices) female_voices = np.array(female_voices) return male_voices, female_voices # How to use load_audio() function male_voices, female_voices = load_audio(os.listdir()) # The function to extract audio features def extract_features(audio_data): audio_waves = audio_data[:,0] samplerate = audio_data[:,1][1] features = [] for audio_wave in tqdm(audio_waves): features.append(mfcc(audio_wave, samplerate=samplerate, numcep=26)) features = np.array(features) return features # Use the extract_features() function male_features = extract_features(male_voices) female_features = extract_features(female_voices) # The function used to concatenate all audio features forming a long 2-dimensional array def concatenate_features(audio_features): concatenated = audio_features[0] for audio_feature in tqdm(audio_features): concatenated = np.vstack((concatenated, audio_feature)) return concatenated # How the function is used male_concatenated = concatenate_features(male_features) female_concatenated = concatenate_features(female_features) print(male_concatenated.shape) # Output: (117576, 26) print(female_concatenated.shape) # Output: (124755, 26) # Concatenate male voices and female voices X = np.vstack((male_concatenated, female_concatenated)) # Create labels y = np.append([0] * len(male_concatenated), [1] * len(female_concatenated)) # Check whether X and y are already having the exact same length print(X.shape) # Output: (242268, 26) print(y.shape) # Output: (242268,) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22) # Initialize SVM model clf = SVC(kernel='rbf') # Train the model start = time() clf.fit(X_train[:50000], y_train[:50000]) print(time()-start) # Output: 184.8018662929535 (seconds) # Compute the accuracy score towards train data start = time() print(clf.score(X_train[:50000], y_train[:50000])) # Output: 0.78204 print(time()-start) # Output: 90.8693311214447 (seconds) # Compute the accuracy score towards test data start = time() print(clf.score(X_test[:10000], y_test[:10000])) # Output: 0.7679 print(time()-start) # Output: 18.082067728042603 (seconds) # Predict the first 10000 test data svm_predictions = clf.predict(X_test[:10000]) # Create the confusion matrix values cm = confusion_matrix(y_test[:10000], svm_predictions) # Create the confusion matrix display plt.figure(figsize=(8,8)) plt.title('Confusion matrix on test data') sns.heatmap(cm, annot=True, fmt='d', cmap=plt.cm.Blues, cbar=False, annot_kws={'size':14}) plt.xlabel('Predicted Label') plt.ylabel('True Label') plt.show() # Performance comparison between different algorithms index = ['SVM-RBF', 'SVM-Poly', 'SVM-Sigmoid', 'Logistic Regression'] # I record all the results below manually values = [184.8, 137.0, 283.6, 0.7] plt.figure(figsize=(12,3)) plt.title('Training duration (lower is better)') plt.xlabel('Seconds') plt.ylabel('Model') plt.barh(index, values, zorder=2) plt.grid(zorder=0) for i, value in enumerate(values): plt.text(value+20, i, str(value)+' secs', fontsize=12, color='black', horizontalalignment='center', verticalalignment='center') plt.show() # set width of bar barWidth = 0.25 index = ['SVM-RBF', 'SVM-Poly', 'SVM-Sigmoid', 'Logistic Regression'] # set height of bar # I record all the results below manually train_acc = [78.2, 74.8, 74.8, 65.8] test_acc = [76.8, 74.3, 74.3, 65.8] # Set position of bar on X axis baseline = np.arange(len(train_acc)) r1 = [x + 0.125 for x in baseline] r2 = [x + 0.25 for x in r1] # Make the plot plt.figure(figsize=(16,9)) plt.title('Model performance (higher is better)') plt.bar(r1, train_acc, width=barWidth, label='Train', zorder=2) plt.bar(r2, test_acc, width=barWidth, label='Test', zorder=2) plt.grid(zorder=0) # Add xticks on the middle of the group bars plt.xlabel('Model') plt.ylabel('Accuracy') plt.xticks([r + barWidth for r in range(len(train_acc))], index) # Create text for i, value in enumerate(train_acc): plt.text(i+0.125, value-5, str(value), fontsize=12, color='white', horizontalalignment='center', verticalalignment='center') for i, value in enumerate(test_acc): plt.text(i+0.375, value-5, str(value), fontsize=12, color='white', horizontalalignment='center', verticalalignment='center') plt.legend() plt.show()
Conclusion
In this article, we have built Gender Recognition by Voice using Python. We have used the SVM(support vector machine) algorithm which is a supervised learning algorithm used for classification. You can try any other classification algorithm for more accurate results. This project is not a real-time working project but we recommend you explore and try how you can convert it to real-time voice recognition.
Hope you liked this article. Thanks for reading!
Also Read:
- Download 1000+ Projects, All B.Tech & Programming Notes, Job, Resume & Interview Guide, and More – Get Your Ultimate Programming Bundle!
- Flower classification using CNN
- Music Recommendation System in Machine Learning
- 100+ Java Projects for Beginners 2023
- Courier Tracking System in HTML CSS and JS
- Test Typing Speed using Python App
- Top 15 Machine Learning Projects in Python with source code
- Top 15 Java Projects For Resume
- Top 10 Java Projects with source code
- Best 100+ Python Projects with source code
- Gender Recognition by Voice using Python
- Top 15 Python Libraries For Data Science in 2022
- Top 15 Python Libraries For Machine Learning in 2022
- Drawing Application in Python Tkinter
- Top 10 Final Year Projects for Computer Science Students
- Setup and Run Machine Learning in Visual Studio Code
- Diabetes prediction using Machine Learning
- Library Management System Project in Java
- Bank Management System Project in Java
- CS Class 12th Python Projects
- 15 Deep Learning Projects for Final year
- Machine Learning Scenario-Based Questions
- Customer Behaviour Analysis – Machine Learning and Python
- NxNxN Matrix in Python 3
- 3 V’s of Big data
- Naive Bayes in Machine Learning
- Top 10 Python Projects for Final year Students
- Automate Data Mining With Python
- Support Vector Machine(SVM) in Machine Learning
- Python OOP Projects | Source code and example