Gender Recognition By Voice Using Python

In this article, we are going to build a project on Gender Recognition by Voice using Python in Machine Learning. This is an era of technology and everyone is using technology nowadays as it’s making our lives so easier. ML models are used in various sectors and also in real-time recognition as well. One of them is gender recognition using the Voice of a person which we are going to create today. Click here to get the complete code. Let’s get started!

Gender Recognition by Voice using Python: Project Overview

Project Name:	Gender Recognition by Voice using Python
Abstract	This will be an ML-based recognition system that can be used to recognize the gender of a person using his/her voice.
Language/s Used:	Python and Machine Learning Libraries
IDE	Google Colab(recommended)
Python version (Recommended):	Python 3.x
Database:	Not required
Type:	Program
Recommended for	Final Year Students

Steps involved in Gender Recognition by Voice

The workflow to build the Gender Recognition by Voice using Python Project is as follows-

Collection of data
Exploring the data
Audio feature extraction
Splitting the data
Building the model
Evaluating the model

1. Collection of data

The very first step is to choose the dataset for our model. We can get a lot of different datasets from Kaggle. You just need to sign in to Kaggle and search for any dataset you need for the project. The dataset we are going to use here is taken from here.

The entire dataset is pretty large which is more than 12 GB. We can use the Kaggle notebook feature so that we can work with the data without downloading them all to the local storage. Since the total size of the dataset is huge, what we can do next is to load only the data from the cv-valid-train folder in which the corresponding audio details are stored in the cv-valid-train.csv file.

2. Exploring the data

As we have decided which data files to work with, now we can get even deeper into the dataset to find out details in the dataset. To start with, we are going to load all the required modules for this project. Before that, we have to install the python_speech_features module.

# Install python_speech_features module
!pip install python_speech_features

# Import all modules
import os
import librosa
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from pydub import AudioSegment
from python_speech_features import mfcc
from time import time

# Load the csv file into data frame
df = pd.read_csv('../input/common-voice/cv-valid-train.csv')

Then, we have to load the data into our project. After all modules and the dataset are loaded, now we are able to check the first 5 rows of the dataset using df. head().

We can see that the dataset got plenty of missing values, including the gender itself. Since the values in the gender column are going to be our target, it is important to filter out the NaN values in that column. In order to do so, we can create two new data frames called df_male and df_female which are used to store male and female voice details respectively.

We got another problem that is the dataset is extremely unbalanced. We can solve this problem by performing the under-sampling method. This is basically done by taking only a small portion of the available data such that the class distribution is going to be equal.

# Create two new data frames
df_male = df[df['gender']=='male']
df_female = df[df['gender']=='female']

# Find out the number of rows
print(df_male.shape)		
# output: (55029, 8) 

print(df_female.shape)		
# output: (18249, 8)

# Take only 300 male and 300 female data
df_male = df_male[:300]
df_female = df_female[:300]

Notice that the audio file names in our dataset are actually having the extension of mp3. Actually, we need to convert these into wav because the Librosa module is just unable to read digital signals stored in mp3 format. In order to do that, we can use a function called convert_to_wav() with the help of the AudioSegment object taken from the Pydub module.

# Define the audio path
TRAIN_PATH = '../input/common-voice/cv-valid-train/'

# The function to convert mp3 to wav
def convert_to_wav(df, m_f, path=TRAIN_PATH):
    srcs = []

    for file in tqdm(df['filename']):
        sound = AudioSegment.from_mp3(path+file)
        
	# Create new wav files based on existing mp3 files
        if m_f == 'male':
            sound.export('male-'+file.split('/')[-1].split('.')[0]+'.wav', format='wav')
        elif m_f == 'female':
            sound.export('female-'+file.split('/')[-1].split('.')[0]+'.wav', format='wav')
      
    return

# How to use the convert_to_wav() function
convert_to_wav(df_male, m_f='male')
convert_to_wav(df_female, m_f='female')

After running the code above, now we should have new files stored in the working directory of the Kaggle notebook. We have to keep in mind that all these files might be saved in another directory instead depending on your specified path, especially when you use your local machine to do this project.

Up to this step, we already got 600 new wav files in which each of those are already having the prefix of either “male” or “female”. Therefore, we can ignore our data frame df that we used earlier and just focus on these new files as now we can extract the target label directly from the file name.

Furthermore, since the files are already having the extension of wav, then we can just employ the Librosa module to actually store the digital signal values into Python variables. Now we define the load_audio() function to take the raw audio data and directly use it.

# Define a function to load the raw audio files
def load_audio(audio_files):
	# Allocate empty list for male and female voices
    male_voices = []
    female_voices = []

    for file in tqdm(audio_files):
        if file.split('-')[0] == 'male':
            male_voices.append(librosa.load(file))
        elif file.split('-')[0] == 'female':
            female_voices.append(librosa.load(file))
    
# Convert the list into Numpy array
    male_voices = np.array(male_voices)
    female_voices = np.array(female_voices)
    
    return male_voices, female_voices

# How to use load_audio() function
male_voices, female_voices = load_audio(os.listdir())

It is important to keep in mind that the values stored in both male_voices and female_voices consist of the raw digital wave itself followed by the sample rate. We can think of a sample rate like the FPS (Frame per Second) rate in videos. So, this basically means that there will be 22050 bits within 1 second of the audio clip.

3. Audio feature extraction

The machine learning algorithm works by labeling samples based on given features. We can think of the labels as y (dependent variable) and features as X (independent variables). In the case of voice recognition, feature extraction plays a big role since basically raw audio data is not quite informative and machine learning algorithms may unable to detect patterns in it.

There are actually plenty of audio feature extraction methods existing out there, and we will use MFCC due to its easy implementation and high recognition accuracy. The entire feature extraction process is wrapped in the extract_features() function.

# The function to extract audio features
def extract_features(audio_data):
	

	audio_waves = audio_data[:,0]
	samplerate = audio_data[:,1][1]

	features = []
	for audio_wave in tqdm(audio_waves):
		features.append(mfcc(audio_wave, samplerate=samplerate, numcep=26))
    
	features = np.array(features)
	return features

# Use the extract_features() function
male_features = extract_features(male_voices)
female_features = extract_features(female_voices)

4. Splitting the data

Up to this, we already got a male_concatenated and female_concatenated array in which both of them store all male and female voices in two long arrays. Next, we are also going to concatenate the two arrays and store the result in X. y variable is necessary to be created as well in order to store all feature labels. Additionally, it is important to know that the labels here are in form of encoded values where 0 represents male and 1 denotes female. After that, we are going to split this X-y pair into training and testing data.

# Concatenate male voices and female voices
X = np.vstack((male_concatenated, female_concatenated))

# Create labels
y = np.append([0] * len(male_concatenated), [1] * len(female_concatenated))

# Check whether X and y are already having the exact same length
print(X.shape)		
# Output: (242268, 26)

print(y.shape)		
# Output: (242268,)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)

It is also worth knowing that by default train_test_split() function shuffles the data samples first prior to the splitting. The shuffling process is okay to do since actually, we do only care about the color of the voice instead of the utterances.

5. Building the model

We are going to use the SVM model as the voice classifier. We can also use any other model in this case, like logistic regression or even a neural network. Let’s see how to implement the SVM classifier.

# Initialize SVM model
clf = SVC(kernel='rbf')      

# Train the model
start = time()
clf.fit(X_train[:50000], y_train[:50000])
print(time()-start)						
# Output: 184.8018662929535 (seconds)

# Compute the accuracy score towards train data
start = time()
print(clf.score(X_train[:50000], y_train[:50000]))		
# Output: 0.78204

print(time()-start)						
# Output: 90.8693311214447 (seconds)

# Compute the accuracy score towards test data
start = time()
print(clf.score(X_test[:10000], y_test[:10000]))		
# Output: 0.7679

print(time()-start)						
# Output: 18.082067728042603 (seconds)

6. Evaluating the model

Now let’s evaluate this SVM classifier model. We can see that the model achieves 78.2% and 76.8% of accuracy towards training and testing data respectively. The confusion matrix on test data is constructed in which the result is shown in the next figure.

# Predict the first 10000 test data
svm_predictions = clf.predict(X_test[:10000])

# Create the confusion matrix values
cm = confusion_matrix(y_test[:10000], svm_predictions)

# Create the confusion matrix display
plt.figure(figsize=(8,8))
plt.title('Confusion matrix on test data')
sns.heatmap(cm, annot=True, fmt='d', 
            cmap=plt.cm.Blues, cbar=False, annot_kws={'size':14})
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

Output for Gender Recognition by Voice using Python

We can see that the Support Vector Machine algorithm with Radial Basis Function is able to identify whether a voice is spoken by a male (0) or female (1) with better accuracy.

Complete Code for Gender Recognition by Voice using Python

# Install python_speech_features module
!pip install python_speech_features

# Import all modules
import os
import librosa
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from pydub import AudioSegment
from python_speech_features import mfcc
from time import time

# Load the csv file into data frame
df = pd.read_csv('../input/common-voice/cv-valid-train.csv')


# Create two new data frames
df_male = df[df['gender']=='male']
df_female = df[df['gender']=='female']

# Find out the number of rows
print(df_male.shape)		
# output: (55029, 8) 

print(df_female.shape)		
# output: (18249, 8)

# Take only 300 male and 300 female data
df_male = df_male[:300]
df_female = df_female[:300]

# Define the audio path
TRAIN_PATH = '../input/common-voice/cv-valid-train/'

# The function to convert mp3 to wav
def convert_to_wav(df, m_f, path=TRAIN_PATH):
    srcs = []

    for file in tqdm(df['filename']):
        sound = AudioSegment.from_mp3(path+file)
        
		# Create new wav files based on existing mp3 files
        if m_f == 'male':
            sound.export('male-'+file.split('/')[-1].split('.')[0]+'.wav', format='wav')
        elif m_f == 'female':
            sound.export('female-'+file.split('/')[-1].split('.')[0]+'.wav', format='wav')
        
    return

# How to use the convert_to_wav() function
convert_to_wav(df_male, m_f='male')
convert_to_wav(df_female, m_f='female')


# Define a function to load the raw audio files
def load_audio(audio_files):
	# Allocate empty list for male and female voices
    male_voices = []
    female_voices = []

    for file in tqdm(audio_files):
        if file.split('-')[0] == 'male':
            male_voices.append(librosa.load(file))
        elif file.split('-')[0] == 'female':
            female_voices.append(librosa.load(file))
    
	# Convert the list into Numpy array
    male_voices = np.array(male_voices)
    female_voices = np.array(female_voices)
    
    return male_voices, female_voices

# How to use load_audio() function
male_voices, female_voices = load_audio(os.listdir())


# The function to extract audio features
def extract_features(audio_data):

	audio_waves = audio_data[:,0]
	samplerate = audio_data[:,1][1]
	
	features = []
	for audio_wave in tqdm(audio_waves):
		features.append(mfcc(audio_wave, samplerate=samplerate, numcep=26))
    
	features = np.array(features)
	return features

# Use the extract_features() function
male_features = extract_features(male_voices)
female_features = extract_features(female_voices)


# The function used to concatenate all audio features forming a long 2-dimensional array
def concatenate_features(audio_features):
    concatenated = audio_features[0]
    for audio_feature in tqdm(audio_features):
        concatenated = np.vstack((concatenated, audio_feature))
        
    return concatenated

# How the function is used
male_concatenated = concatenate_features(male_features)
female_concatenated = concatenate_features(female_features)

print(male_concatenated.shape) 		
# Output: (117576, 26)

print(female_concatenated.shape)	
# Output: (124755, 26)


# Concatenate male voices and female voices
X = np.vstack((male_concatenated, female_concatenated))

# Create labels
y = np.append([0] * len(male_concatenated), [1] * len(female_concatenated))

# Check whether X and y are already having the exact same length
print(X.shape)		
# Output: (242268, 26)

print(y.shape)		
# Output: (242268,)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)


# Initialize SVM model
clf = SVC(kernel='rbf')      

# Train the model
start = time()
clf.fit(X_train[:50000], y_train[:50000])
print(time()-start)						
# Output: 184.8018662929535 (seconds)

# Compute the accuracy score towards train data
start = time()
print(clf.score(X_train[:50000], y_train[:50000]))		
# Output: 0.78204

print(time()-start)						
# Output: 90.8693311214447 (seconds)

# Compute the accuracy score towards test data
start = time()
print(clf.score(X_test[:10000], y_test[:10000]))		
# Output: 0.7679

print(time()-start)						
# Output: 18.082067728042603 (seconds)


# Predict the first 10000 test data
svm_predictions = clf.predict(X_test[:10000])

# Create the confusion matrix values
cm = confusion_matrix(y_test[:10000], svm_predictions)

# Create the confusion matrix display
plt.figure(figsize=(8,8))
plt.title('Confusion matrix on test data')
sns.heatmap(cm, annot=True, fmt='d', 
            cmap=plt.cm.Blues, cbar=False, annot_kws={'size':14})
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()


# Performance comparison between different algorithms
index = ['SVM-RBF', 'SVM-Poly', 'SVM-Sigmoid', 'Logistic Regression']

# I record all the results below manually
values = [184.8, 137.0, 283.6, 0.7]

plt.figure(figsize=(12,3))
plt.title('Training duration (lower is better)')
plt.xlabel('Seconds')
plt.ylabel('Model')
plt.barh(index, values, zorder=2)
plt.grid(zorder=0)

for i, value in enumerate(values):
    plt.text(value+20, i, str(value)+' secs', fontsize=12, color='black',
             horizontalalignment='center', verticalalignment='center')

plt.show()


# set width of bar
barWidth = 0.25
    
index = ['SVM-RBF', 'SVM-Poly', 'SVM-Sigmoid', 'Logistic Regression']

# set height of bar
# I record all the results below manually
train_acc = [78.2, 74.8, 74.8, 65.8]
test_acc = [76.8, 74.3, 74.3, 65.8]
 
# Set position of bar on X axis
baseline = np.arange(len(train_acc))
r1 = [x + 0.125 for x in baseline]
r2 = [x + 0.25 for x in r1]
 
# Make the plot
plt.figure(figsize=(16,9))
plt.title('Model performance (higher is better)')
plt.bar(r1, train_acc, width=barWidth, label='Train', zorder=2)
plt.bar(r2, test_acc, width=barWidth, label='Test', zorder=2)
plt.grid(zorder=0)
 
# Add xticks on the middle of the group bars
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.xticks([r + barWidth for r in range(len(train_acc))], index)

# Create text
for i, value in enumerate(train_acc):
    plt.text(i+0.125, value-5, str(value), fontsize=12, color='white',
             horizontalalignment='center', verticalalignment='center')
    
for i, value in enumerate(test_acc):
    plt.text(i+0.375, value-5, str(value), fontsize=12, color='white',
             horizontalalignment='center', verticalalignment='center')
    
plt.legend()
plt.show()

Conclusion

In this article, we have built Gender Recognition by Voice using Python. We have used the SVM(support vector machine) algorithm which is a supervised learning algorithm used for classification. You can try any other classification algorithm for more accurate results. This project is not a real-time working project but we recommend you explore and try how you can convert it to real-time voice recognition.

Hope you liked this article. Thanks for reading!

Also Read:

Gender Recognition by Voice using Python