In Part-1, we’ve covered how to get started with your first machine learning project on Kaggle. Firstly, we saw how to import necessary libraries, then how to load data, and finally did exploratory data analysis to understand the data. Now we will proceed to Data preprocessing of the Titanic dataset. Data preprocessing is a very crucial step in any machine learning project. Because the accuracy of the model depends completely on the quality of the dataset. Therefore it is necessary to clean the data before actually building the model. In a nutshell for any machine learning project, these are the basic steps you need to perform.
Steps of Machine Learning Project
- Importing Libraries & Loading Dataset. (Part 1)
- Exploratory analysis of data. ( Visualizations) (Part 1)
- Data Preprocessing
- Building machine learning models
- Prediction
Let’s start with data preprocessing now.
Data Preprocessing
Firstly, let’s see how many null values or missing values are there in each column. For this, we’re going to use isnull() and sum() methods. Write the following line in a new code cell and do shift + enter to run this cell.
train_data.isnull().sum()
Ouput:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 0
dtype: int64
So, As we can see there are 177 Missing values in Age column, and 687 missing values in Cabin column. Hence, we have to deal with these missing values in order to build a good machine learning model. Firstly, let us start by handling missing values of age column.
Handling Missing Values of Age Column
In order to handle null values, we are going to fill missing values with random values within the range of [mean value – standard deviation, mean value + standard deviation]. So that we get normally distributed data.
mean = train_data["Age"].mean()
std = train_data["Age"].std()
rand_age = np.random.randint(mean-std, mean+std, size = 177)
age_slice = train_data["Age"].copy()
age_slice[np.isnan(age_slice)] = rand_age
train_data["Age"] = age_slice
# Again checking for null values
train_data.isnull().sum()
Ouput:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 0
dtype: int64
As you can see there are no missing values in the age column now. But what about 687 null values of the Cabin column? Actually, we don’t need to do anything about that. Because if you observe carefully there are some features that have nothing to do with survival probability like PassengerId, Ticket No., Name of the passenger, and Cabin number. So we can safely drop them before building our ml model.
Dropping 🗑️ Columns
col_to_drop = ["PassengerId", "Ticket", "Cabin", "Name"]
train_data.drop(col_to_drop, axis=1, inplace=True)
train_data.head(10)
Converting Categorical Variables to Numeric
Now as you can observe we have 2 categorical variables namely Sex and Embarked. Machine learning models only understand numbers and not alphabets. So we have to convert these categorical columns to numerical.
genders = {"male":0, "female":1}
train_data["Sex"] = train_data["Sex"].map(genders)
ports = {"S":0, "C":1, "Q":2}
train_data["Embarked"] = train_data["Embarked"].map(ports)
train_data.head()
Building Machine Learning Model 🤖
So, this was all about data preprocessing. Now we are good to go with our titanic dataset. Let’s quickly train our machine learning model.
df_train_x = train_data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
# Target variable column
df_train_y = train_data[['Survived']]
# Train Test Splitting
x_train, x_test, y_train, y_test = train_test_split(df_train_x, df_train_y, test_size=0.20, random_state=42)
Lastly, We are going to fit our model on 5 different classification algorithms namely Random Forest Classifier, Logistic Regression, K-Neighbor Classifier, Decision Tree Classifier, and Support Vector Machine. And eventually will compare them.
Random Forest Classifier
# Creating alias for Classifier
clf1 = RandomForestClassifier()
# Fitting the model using training data
clf1.fit(x_train, y_train)
# Predicting on test data
rfc_y_pred = clf1.predict(x_test)
# Calculating Accuracy to compare all models
rfc_accuracy = accuracy_score(y_test,rfc_y_pred) * 100
print("accuracy=",rfc_accuracy)
Logistic Regression
clf2 = LogisticRegression()
clf2.fit(x_train, y_train)
lr_y_pred = clf2.predict(x_test)
lr_accuracy = accuracy_score(y_test,lr_y_pred)*100
print("accuracy=",lr_accuracy)
K-Neighbor Classifier
clf3 = KNeighborsClassifier(5)
clf3.fit(x_train, y_train)
knc_y_pred = clf3.predict(x_test)
knc_accuracy = accuracy_score(y_test,knc_y_pred)*100
print("accuracy=",knc_accuracy)
Decision Tree Classifier
clf4 = tree.DecisionTreeClassifier()
clf4 = clf4.fit(x_train, y_train)
dtc_y_pred = clf4.predict(x_test)
dtc_accuracy = accuracy_score(y_test,dtc_y_pred)*100
print("accuracy=",dtc_accuracy)
Support Vector Machine
clf5 = svm.SVC()
clf5.fit(x_train, y_train)
svm_y_pred = clf5.predict(x_test)
svm_accuracy = accuracy_score(y_test,svm_y_pred)*100
print("accuracy=",svm_accuracy)
Accuracy Scores of All Classifiers
print("Accuracy of Random Forest Classifier =",rfc_accuracy)
print("Accuracy of Logistic Regressor =",lr_accuracy)
print("Accuracy of K-Neighbor Classifier =",knc_accuracy)
print("Accuracy of Decision Tree Classifier = ",dtc_accuracy)
print("Accuracy of Support Vector Machine Classifier = ",svm_accuracy)
Ouput:
Accuracy of Random Forest Classifier = 81.56424581005587
Accuracy of Logistic Regressor = 79.88826815642457
Accuracy of K-Neighbor Classifier = 70.39106145251397
Accuracy of Decision Tree Classifier = 80.44692737430168
Accuracy of Support Vector Machine Classifier = 65.36312849162012
Subsequently, we can now rank our evaluation of all the models to choose the best one for our problem. While both Decision Tree and Random Forest score almost the same, we choose to use Random Forest as they correct for decision trees’ habit of overfitting to their training set.
Final Prediction with Machine Learning Model
So, now it’s time to use test.csv for making predictions. For testing data also we need to do the steps of preprocessing that we did earlier. And then only we can predict whether a passenger will survive or not. Hence, I highly encourage you to do all the things for test.csv by yourself.
# Importing test.csv
test_data = pd.read_csv('/kaggle/input/titanic/test.csv')
test_data.head(10)
test_data.info()
test_data.isnull().sum()
Output:
PassengerId 0
Pclass 0
Name 0
Sex 0
Age 86
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 327
Embarked 0
dtype: int64
# Replacing missing values of age column
mean = test_data["Age"].mean()
std = test_data["Age"].std()
rand_age = np.random.randint(mean-std, mean+std, size = 86)
age_slice = test_data["Age"].copy()
age_slice[np.isnan(age_slice)] = rand_age
test_data["Age"] = age_slice
# Replacing missing value of Fare column
test_data['Fare'].fillna(test_data['Fare'].mean(), inplace=True)
test_data.isnull().sum()
col_to_drop = ["PassengerId", "Ticket", "Cabin", "Name"]
test_data.drop(col_to_drop, axis=1, inplace=True)
test_data.head(10)
genders = {"male":0, "female":1}
test_data["Sex"] = test_data["Sex"].map(genders)
ports = {"S":0, "C":1, "Q":2}
test_data["Embarked"] = test_data["Embarked"].map(ports)
test_data.head()
Machine Learning Project Submission
x_test = test_data
y_pred = clf1.predict(x_test)
originaltest_data = pd.read_csv('/kaggle/input/titanic/test.csv')
submission = pd.DataFrame({
"PassengerId": originaltest_data["PassengerId"],
"Survived": y_pred
})
submission.head(20)
So, this was all about the titanic survival prediction project. Don’t forget to submit your first notebook on Kaggle under this machine learning challenge. I would love to see your notebook on Kaggle and will surely upvote it. Moreover, don’t forget to check out this notebook having the complete code altogether that we’ve discussed.
Thankyou! If you like this article leave a comment “Nice article!” to motivate us. Keep learning, keep coding!
Try these articles for machine learning basics:-
- Machine Learning: A Gentle IntroductionIntroduction to Machine Learning Machine Learning is probably one of the most interesting and hyped branches of computer science. The thing that separates humans from machines is the fact that humans learn from their experiences. But is it possible to make a machine learn? And The answer is Yes! It is possible through Machine Learning….
- Machine Learning Course DescriptionBefore you start, let me give you an overview of what this series has to offer you. Our machine learning course series comprises of the following sections:- ML Environment Setup and Overview Jupyter Notebook: The Ultimate Guide Numpy Pandas Matplotlib Seaborn Sklearn Linear Regression Logistic Regression Decision Tree Random Forest Support Vector Machine K Nearest…
- ML Environment Setup and OverviewIntroduction to Machine Learning In this article, you will learn about the ML Environment Setup, Machine Learning terminology, its paradigms, and a tutorial to help you set up your machine so you can code what you learn. Before we start with our ML Environment Setup, read this article to get an overview of machine learning….
- Jupyter Notebook: The Ultimate GuideIntroduction to Jupyter Notebook Whenever one starts programming the first aim of that person is to find an IDE that suits his/her needs. In ML there are times when you’ll want to keep a check on your data after doing a change. But in code editors like Vim, Vscode, etc. you have to run your…
- Numpy For Machine Learning: A Complete GuideUp until now you’ve learned about the general idea of what ML does, set up your environment, and got to know about the working of your coding environment i.e. Jupyter Notebook. In this section, you’ll learn about a very powerful library called Numpy. We’ll learn about Numpy Array(np array for short) and operations on them,…
- Python Pandas Tutorial: A Complete Introduction for BeginnersIn the previous section, we learned about Numpy and how we can use it to load, save, and pre-process data easily by using Numpy Arrays. Now Numpy is a great library to do data preprocessing but I’d like to tell you all about another wonderful Python library called Pandas. At the end of this tutorial,…
- Matplotlib Python: A Beginner’s WalkthroughWe know how to analyze data by analyzing the statistics of the data and we’ve learned how to manipulate the data. But is statistics enough to analyze the data? Short answer, Visualization of data is necessary in order to find details that we missed that’s why Matplotlib Python is the best library to visualize data…
- Seaborn: Create Elegant PlotsIn the previous tutorial, we learned why data visualization is important and how we can create plots using matplotlib. In this tutorial, we’ll learn about another data visualization library called Seaborn, which is built on top of matplotlib. But why do we need seaborn if we have matplotlib? Using seaborn you can make plots that…
- Set up Python EnvironmentNow, it’s time to install the tools that we will use to write programs. So, we will be learning to Set up Python Environment in this article. Let’s start. 1. Installing Python first. First, we need to go to the official site of python: https://www.python.org/ Now we need to go to the downloads page of…
- Linear Regression: Your 1st Step in Machine LearningHi guys! So until now, we’ve learned about how we can use libraries to play with data. We did data analysis on a real dataset and we also learned how to visualize data. But what was the purpose behind it? Why do so many things? What are we trying to achieve? I’ll tell you all…