Hate Speech Detection With Python

In this article, let’s build the Hate speech detection project in Python. In the current era of the Internet, it is obvious that almost everyone has social media apps to connect and interact with people around the world. At the same time, social media is a place where a lot of personal opinions have been shared about anyone. And most of the time those opinions are offensive and hateful.

Project Overview: Hate Speech Detection

Project Name:	Hate Speech Detection in Machine Learning with Python
Abstract:	In the project, we will learn how to do Hate Speech Detection using Python programming language
Language/Technologies Used:	Python, NLTK, Pandas, NumPy
IDE	Google Colab or Jupyter
Python version (Recommended):	3.8 or 3.9
Type:	Machine Learning and Deep Learning Project
Developer:	Keerthana Buvaneshwaran
Updates:	0

Project Information/overview

What is Hate Speech detection?

Hate speech detection is the model which identifies and detects hateful and offensive speech being poured on the internet. Social media is a place for many people to make hateful and offensive comments about others. So hate speech detection has become an important solution to problems in today’s online world.

As we understood the main goal to build this project, let’s start with building the Hate Speech detection project in python.

Steps in building Hate Speech detection using Machine Learning

Before moving into the implementation part directly, let us get an insight into the steps in building a Hate Speech detection project with Python.

Set up the development environment
Understand the data
Import the required libraries
Preprocess the data
Split the data
Build the model
Evaluate the results

Setting up the development environment

The first major step is to set up the development environment for building a Hate Speech detection project with Python. For developing a Hate Speech detection project you should have the system with Jupyter notebook software installed. Else, you can also use Google Colab https://colab.research.google.com/ for developing this project.

Understanding the data

The dataset for building our hate speech detection model is available on www.kaggle.com. The dataset consists of Twitter hate speech detection data, used to research hate-speech detection. The text in the data is classified as hate speech, offensive language, and neither. Due to the nature of the study, it’s important to note that this dataset contains text that can be considered racist, sexist, homophobic, or generally offensive.

You can find the dataset for hate speech detection here https://www.kaggle.com/datasets/mrmorj/hate-speech-and-offensive-language-dataset

There are 7 columns in the hate speech detection dataset. They are index, count, hate_speech, offensive_language, neither, class and tweet. The description of the column is as follows.

index – This column has the index value
count– It has the number of users who coded each tweet
hate_speech – This column has the number of users who judged the tweet to be hate speech
offensive_language – It has the number of users who judged the tweet to be offensive
neither – This has the number of users who judged the tweet to be neither offensive nor non-offensive
class – it has a class label for the majority of the users, in which 0 denotes hate speech, 1 means offensive language and 2 denotes neither of them.
tweet – This column has the text tweet.

Importing the required libraries

After analyzing the data our next step is to import the required libraries for our project. Some of the libraries we use in this project are pandas, numpy, scikit learn, and nltk.

#Importing the packages
import pandas as pd
import numpy as np
from sklearn. feature_extraction. text import CountVectorizer
from sklearn. model_selection import train_test_split
from sklearn. tree import DecisionTreeClassifier

We are going to import NLTK( The Natural Language Toolkit) library, used for symbolic and statistical natural language processing for English written in the Python programming language.

import nltk
import re
#nltk. download('stopwords')
from nltk. corpus import stopwords
stopword=set(stopwords.words('english'))
stemmer = nltk. SnowballStemmer("english")

After importing the required libraries, it is time to load the data in our project.

data = pd. read_csv("data.csv")
#To preview the data
print(data. head())

Output:

Preprocessing the data

In Data preprocessing, we prepare the raw data and make it suitable for a machine learning model. It is the first and crucial step while creating a machine learning model. When creating a machine learning project, it is not always a case that we come across clean and formatted data. And while doing any operation with data, it is mandatory to clean it and put it in a formatted way. So for this, we use the data preprocessing task.

data["labels"] = data["class"]. map({0: "Hate Speech", 1: "Offensive Speech", 2: "No Hate and Offensive Speech"})
data = data[["tweet", "labels"]]
print(data. head())

Output:

We have used two important Natural Language processing terms, stopword and stemmer. Stopwords are the useless words (data), in natural language processing. We can avoid those words from the input. Stemming is the process of producing morphological variants of a root word. We have to find the stem word for each text better and easy prediction.

def clean (text):
 text = str (text). lower()
 text = re. sub('[.?]', '', text) 
 text = re. sub('https?://\S+|www.\S+', '', text)
 text = re. sub('<.?>+', '', text)
 text = re. sub('[%s]' % re. escape(string. punctuation), '', text)
 text = re. sub('\n', '', text)
 text = re. sub('\w\d\w', '', text)
 text = [word for word in text.split(' ') if word not in stopword]
 text=" ". join(text)
 text = [stemmer. stem(word) for word in text. split(' ')]
 text=" ". join(text)
 return text
data["tweet"] = data["tweet"]. apply(clean)

Splitting the data

The next important step is to explore the dataset and divide the dataset into training and testing data.

x = np. array(data["tweet"])
y = np. array(data["labels"])
cv = CountVectorizer()
X = cv. fit_transform(x)
# Splitting the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Building the model

After segregating the data, our next work is to find a good algorithm suited for our model. We can use a Decision tree classifier for building the Hate Speech detection project. Decision Trees are a type of Supervised Machine Learning used mainly for classification problems.

#Model building
model = DecisionTreeClassifier()
#Training the model
model. fit(X_train,y_train)

Evaluating the results

The final step in machine learning model building is prediction. In this step, we can measure how well our model performs for the test input.

#Testing the model
y_pred = model. predict (X_test)
y_pred

Output:

#Accuracy Score of our model
from sklearn. metrics import accuracy_score
print (accuracy_score (y_test,y_pred))

Output:
0.8745567917838366

We can infer that our model for Hate speech detection performs with an accuracy of 87 percent.

#Predicting the outcome
inp = "You are too bad and I dont like your attitude"
inp = cv.transform([inp]).toarray()
print(model.predict(inp))

Output:
[‘Offensive Speech’]

inp = "It is really awesome"
inp = cv. transform([inp]). toarray()
print(model. predict(inp))

Output:
[‘No Hate and Offensive Speech’]

Complete code for Hate Speech Detection in Python

#Importing the packages
import pandas as pd
import numpy as np
from sklearn. feature_extraction. text import CountVectorizer
from sklearn. model_selection import train_test_split
from sklearn. tree import DecisionTreeClassifier
import nltk
import re
nltk. download('stopwords')
from nltk. corpus import stopwords
stopword=set(stopwords.words('english'))
stemmer = nltk. SnowballStemmer("english")
data = pd. read_csv("data.csv")
#To preview the data
print(data. head())
data["labels"] = data["class"]. map({0: "Hate Speech", 1: "Offensive Speech", 2: "No Hate and Offensive Speech"})
data = data[["tweet", "labels"]]
print(data. head())
def clean (text):
text = str (text). lower()
text = re. sub('[.?]', '', text)
text = re. sub('https?://\S+|www.\S+', '', text)
text = re. sub('<.?>+', '', text)
text = re. sub('[%s]' % re. escape(string. punctuation), '', text)
text = re. sub('\n', '', text)
text = re. sub('\w\d\w', '', text)
text = [word for word in text.split(' ') if word not in stopword]
text=" ". join(text)
text = [stemmer. stem(word) for word in text. split(' ')]
text=" ". join(text)
return text
data["tweet"] = data["tweet"]. apply(clean)
x = np. array(data["tweet"])
y = np. array(data["labels"])
cv = CountVectorizer()
X = cv. fit_transform(x)
#Splitting the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
#Model building
model = DecisionTreeClassifier()
#Training the model
model. fit(X_train,y_train)
#Testing the model
y_pred = model. predict (X_test)
y_pred#Accuracy Score of our model
from sklearn. metrics import accuracy_score
print (accuracy_score (y_test,y_pred))
#Predicting the outcome
inp = "You are too bad and I dont like your attitude"
inp = cv.transform([inp]).toarray()
print(model.predict(inp))

Conclusion

In this article, we have built a project for Hate Speech detection using Machine Learning. Hate speech is one of the serious issues we see on social media platforms like Facebook and Twitter. Hope you enjoyed this article by building a project to detect hate speech with Python.

Hate speech detection with Python