In this article, let’s build the Hate speech detection project in Python. In the current era of the Internet, it is obvious that almost everyone has social media apps to connect and interact with people around the world. At the same time, social media is a place where a lot of personal opinions have been shared about anyone. And most of the time those opinions are offensive and hateful.
Project Overview: Hate Speech Detection
Project Name: | Hate Speech Detection in Machine Learning with Python |
Abstract: | In the project, we will learn how to do Hate Speech Detection using Python programming language |
Language/Technologies Used: | Python, NLTK, Pandas, NumPy |
IDE | Google Colab or Jupyter |
Python version (Recommended): | 3.8 or 3.9 |
Type: | Machine Learning and Deep Learning Project |
Developer: | Keerthana Buvaneshwaran |
Updates: | 0 |
What is Hate Speech detection?
Hate speech detection is the model which identifies and detects hateful and offensive speech being poured on the internet. Social media is a place for many people to make hateful and offensive comments about others. So hate speech detection has become an important solution to problems in today’s online world.
As we understood the main goal to build this project, let’s start with building the Hate Speech detection project in python.
Steps in building Hate Speech detection using Machine Learning
Before moving into the implementation part directly, let us get an insight into the steps in building a Hate Speech detection project with Python.
- Set up the development environment
- Understand the data
- Import the required libraries
- Preprocess the data
- Split the data
- Build the model
- Evaluate the results
Setting up the development environment
The first major step is to set up the development environment for building a Hate Speech detection project with Python. For developing a Hate Speech detection project you should have the system with Jupyter notebook software installed. Else, you can also use Google Colab https://colab.research.google.com/ for developing this project.
Understanding the data
The dataset for building our hate speech detection model is available on www.kaggle.com. The dataset consists of Twitter hate speech detection data, used to research hate-speech detection. The text in the data is classified as hate speech, offensive language, and neither. Due to the nature of the study, it’s important to note that this dataset contains text that can be considered racist, sexist, homophobic, or generally offensive.
You can find the dataset for hate speech detection here https://www.kaggle.com/datasets/mrmorj/hate-speech-and-offensive-language-dataset
There are 7 columns in the hate speech detection dataset. They are index, count, hate_speech, offensive_language, neither, class and tweet. The description of the column is as follows.
index – This column has the index value
count– It has the number of users who coded each tweet
hate_speech – This column has the number of users who judged the tweet to be hate speech
offensive_language – It has the number of users who judged the tweet to be offensive
neither – This has the number of users who judged the tweet to be neither offensive nor non-offensive
class – it has a class label for the majority of the users, in which 0 denotes hate speech, 1 means offensive language and 2 denotes neither of them.
tweet – This column has the text tweet.
Importing the required libraries
After analyzing the data our next step is to import the required libraries for our project. Some of the libraries we use in this project are pandas, numpy, scikit learn, and nltk.
#Importing the packages
import pandas as pd
import numpy as np
from sklearn. feature_extraction. text import CountVectorizer
from sklearn. model_selection import train_test_split
from sklearn. tree import DecisionTreeClassifier
We are going to import NLTK( The Natural Language Toolkit) library, used for symbolic and statistical natural language processing for English written in the Python programming language.
import nltk
import re
#nltk. download('stopwords')
from nltk. corpus import stopwords
stopword=set(stopwords.words('english'))
stemmer = nltk. SnowballStemmer("english")
After importing the required libraries, it is time to load the data in our project.
data = pd. read_csv("data.csv")
#To preview the data
print(data. head())
Output:
Preprocessing the data
In Data preprocessing, we prepare the raw data and make it suitable for a machine learning model. It is the first and crucial step while creating a machine learning model. When creating a machine learning project, it is not always a case that we come across clean and formatted data. And while doing any operation with data, it is mandatory to clean it and put it in a formatted way. So for this, we use the data preprocessing task.
data["labels"] = data["class"]. map({0: "Hate Speech", 1: "Offensive Speech", 2: "No Hate and Offensive Speech"})
data = data[["tweet", "labels"]]
print(data. head())
Output:
We have used two important Natural Language processing terms, stopword and stemmer. Stopwords are the useless words (data), in natural language processing. We can avoid those words from the input. Stemming is the process of producing morphological variants of a root word. We have to find the stem word for each text better and easy prediction.
def clean (text):
text = str (text). lower()
text = re. sub('[.?]', '', text)
text = re. sub('https?://\S+|www.\S+', '', text)
text = re. sub('<.?>+', '', text)
text = re. sub('[%s]' % re. escape(string. punctuation), '', text)
text = re. sub('\n', '', text)
text = re. sub('\w\d\w', '', text)
text = [word for word in text.split(' ') if word not in stopword]
text=" ". join(text)
text = [stemmer. stem(word) for word in text. split(' ')]
text=" ". join(text)
return text
data["tweet"] = data["tweet"]. apply(clean)
Splitting the data
The next important step is to explore the dataset and divide the dataset into training and testing data.
x = np. array(data["tweet"])
y = np. array(data["labels"])
cv = CountVectorizer()
X = cv. fit_transform(x)
# Splitting the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Building the model
After segregating the data, our next work is to find a good algorithm suited for our model. We can use a Decision tree classifier for building the Hate Speech detection project. Decision Trees are a type of Supervised Machine Learning used mainly for classification problems.
#Model building
model = DecisionTreeClassifier()
#Training the model
model. fit(X_train,y_train)
Evaluating the results
The final step in machine learning model building is prediction. In this step, we can measure how well our model performs for the test input.
#Testing the model
y_pred = model. predict (X_test)
y_pred
Output:
#Accuracy Score of our model
from sklearn. metrics import accuracy_score
print (accuracy_score (y_test,y_pred))
Output:
0.8745567917838366
We can infer that our model for Hate speech detection performs with an accuracy of 87 percent.
#Predicting the outcome
inp = "You are too bad and I dont like your attitude"
inp = cv.transform([inp]).toarray()
print(model.predict(inp))
Output:
[‘Offensive Speech’]
inp = "It is really awesome"
inp = cv. transform([inp]). toarray()
print(model. predict(inp))
Output:
[‘No Hate and Offensive Speech’]
Complete code for Hate Speech Detection in Python
#Importing the packages import pandas as pd import numpy as np from sklearn. feature_extraction. text import CountVectorizer from sklearn. model_selection import train_test_split from sklearn. tree import DecisionTreeClassifier import nltk import re nltk. download('stopwords') from nltk. corpus import stopwords stopword=set(stopwords.words('english')) stemmer = nltk. SnowballStemmer("english") data = pd. read_csv("data.csv") #To preview the data print(data. head()) data["labels"] = data["class"]. map({0: "Hate Speech", 1: "Offensive Speech", 2: "No Hate and Offensive Speech"}) data = data[["tweet", "labels"]] print(data. head()) def clean (text): text = str (text). lower() text = re. sub('[.?]', '', text) text = re. sub('https?://\S+|www.\S+', '', text) text = re. sub('<.?>+', '', text) text = re. sub('[%s]' % re. escape(string. punctuation), '', text) text = re. sub('\n', '', text) text = re. sub('\w\d\w', '', text) text = [word for word in text.split(' ') if word not in stopword] text=" ". join(text) text = [stemmer. stem(word) for word in text. split(' ')] text=" ". join(text) return text data["tweet"] = data["tweet"]. apply(clean) x = np. array(data["tweet"]) y = np. array(data["labels"]) cv = CountVectorizer() X = cv. fit_transform(x) #Splitting the Data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) #Model building model = DecisionTreeClassifier() #Training the model model. fit(X_train,y_train) #Testing the model y_pred = model. predict (X_test) y_pred#Accuracy Score of our model from sklearn. metrics import accuracy_score print (accuracy_score (y_test,y_pred)) #Predicting the outcome inp = "You are too bad and I dont like your attitude" inp = cv.transform([inp]).toarray() print(model.predict(inp))
Conclusion
In this article, we have built a project for Hate Speech detection using Machine Learning. Hate speech is one of the serious issues we see on social media platforms like Facebook and Twitter. Hope you enjoyed this article by building a project to detect hate speech with Python.
Also Read:
- Flower classification using CNN
- Music Recommendation System in Machine Learning
- Create your own ChatGPT with Python
- Bakery Management System in Python | Class 12 Project
- SQLite | CRUD Operations in Python
- Event Management System Project in Python
- Ticket Booking and Management in Python
- Hostel Management System Project in Python
- Sales Management System Project in Python
- Bank Management System Project in C++
- Python Download File from URL | 4 Methods
- Python Programming Examples | Fundamental Programs in Python
- Spell Checker in Python
- Portfolio Management System in Python
- Stickman Game in Python
- Contact Book project in Python
- Loan Management System Project in Python
- Cab Booking System in Python
- Brick Breaker Game in Python
- Tank game in Python
- GUI Piano in Python
- Ludo Game in Python
- Rock Paper Scissors Game in Python
- Snake and Ladder Game in Python
- Puzzle Game in Python
- Medical Store Management System Project in Python
- Creating Dino Game in Python
- Tic Tac Toe Game in Python
- Test Typing Speed using Python App
- MoviePy: Python Video Editing Library