In Part-1, we’ve covered how to get started with your first machine learning project on Kaggle. Firstly, we saw how to import necessary libraries, then how to load data, and finally did exploratory data analysis to understand the data. Now we will proceed to Data preprocessing of the Titanic dataset. Data preprocessing is a very crucial step in any machine learning project. Because the accuracy of the model depends completely on the quality of the dataset. Therefore it is necessary to clean the data before actually building the model. In a nutshell for any machine learning project, these are the basic steps you need to perform.
Steps of Machine Learning Project
- Importing Libraries & Loading Dataset. (Part 1)
- Exploratory analysis of data. ( Visualizations) (Part 1)
- Data Preprocessing
- Building machine learning models
- Prediction
Let’s start with data preprocessing now.
Data Preprocessing
Firstly, let’s see how many null values or missing values are there in each column. For this, we’re going to use isnull() and sum() methods. Write the following line in a new code cell and do shift + enter to run this cell.
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 0
dtype: int64
So, As we can see there are 177 Missing values in Age column, and 687 missing values in Cabin column. Hence, we have to deal with these missing values in order to build a good machine learning model. Firstly, let us start by handling missing values of age column.

Handling Missing Values of Age Column
In order to handle null values, we are going to fill missing values with random values within the range of [mean value – standard deviation, mean value + standard deviation]. So that we get normally distributed data.
mean = train_data["Age"].mean()
std = train_data["Age"].std()
rand_age = np.random.randint(mean-std, mean+std, size = 177)
age_slice = train_data["Age"].copy()
age_slice[np.isnan(age_slice)] = rand_age
train_data["Age"] = age_slice
# Again checking for null values
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 0
dtype: int64
As you can see there are no missing values in the age column now. But what about 687 null values of the Cabin column? Actually, we don’t need to do anything about that. Because if you observe carefully there are some features that have nothing to do with survival probability like PassengerId, Ticket No., Name of the passenger, and Cabin number. So we can safely drop them before building our ml model.
Dropping 🗑️ Columns
col_to_drop = ["PassengerId", "Ticket", "Cabin", "Name"]
train_data.drop(col_to_drop, axis=1, inplace=True)
Converting Categorical Variables to Numeric
Now as you can observe we have 2 categorical variables namely Sex and Embarked. Machine learning models only understand numbers and not alphabets. So we have to convert these categorical columns to numerical.
genders = {"male":0, "female":1}
train_data["Sex"] = train_data["Sex"].map(genders)
ports = {"S":0, "C":1, "Q":2}
train_data["Embarked"] = train_data["Embarked"].map(ports)
Building Machine Learning Model 🤖
So, this was all about data preprocessing. Now we are good to go with our titanic dataset. Let’s quickly train our machine learning model.
df_train_x = train_data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
# Target variable column
df_train_y = train_data[['Survived']]
# Train Test Splitting
x_train, x_test, y_train, y_test = train_test_split(df_train_x, df_train_y, test_size=0.20, random_state=42)
Lastly, We are going to fit our model on 5 different classification algorithms namely Random Forest Classifier, Logistic Regression, K-Neighbor Classifier, Decision Tree Classifier, and Support Vector Machine. And eventually will compare them.
Random Forest Classifier
# Creating alias for Classifier
clf1 = RandomForestClassifier()
# Fitting the model using training data
clf1.fit(x_train, y_train)
# Predicting on test data
rfc_y_pred = clf1.predict(x_test)
# Calculating Accuracy to compare all models
rfc_accuracy = accuracy_score(y_test,rfc_y_pred) * 100
Logistic Regression
clf2 = LogisticRegression()
clf2.fit(x_train, y_train)
lr_y_pred = clf2.predict(x_test)
lr_accuracy = accuracy_score(y_test,lr_y_pred)*100
K-Neighbor Classifier
clf3 = KNeighborsClassifier(5)
clf3.fit(x_train, y_train)
knc_y_pred = clf3.predict(x_test)
knc_accuracy = accuracy_score(y_test,knc_y_pred)*100
Decision Tree Classifier
clf4 = tree.DecisionTreeClassifier()
clf4 = clf4.fit(x_train, y_train)
dtc_y_pred = clf4.predict(x_test)
dtc_accuracy = accuracy_score(y_test,dtc_y_pred)*100
Support Vector Machine
clf5 = svm.SVC()
clf5.fit(x_train, y_train)
svm_y_pred = clf5.predict(x_test)
svm_accuracy = accuracy_score(y_test,svm_y_pred)*100
Accuracy Scores of All Classifiers
print("Accuracy of Random Forest Classifier =",rfc_accuracy)
print("Accuracy of Logistic Regressor =",lr_accuracy)
print("Accuracy of K-Neighbor Classifier =",knc_accuracy)
print("Accuracy of Decision Tree Classifier = ",dtc_accuracy)
print("Accuracy of Support Vector Machine Classifier = ",svm_accuracy)
Accuracy of Random Forest Classifier = 81.56424581005587
Accuracy of Logistic Regressor = 79.88826815642457
Accuracy of K-Neighbor Classifier = 70.39106145251397
Accuracy of Decision Tree Classifier = 80.44692737430168
Accuracy of Support Vector Machine Classifier = 65.36312849162012
Subsequently, we can now rank our evaluation of all the models to choose the best one for our problem. While both Decision Tree and Random Forest score almost the same, we choose to use Random Forest as they correct for decision trees’ habit of overfitting to their training set.
Final Prediction with Machine Learning Model
So, now it’s time to use test.csv for making predictions. For testing data also we need to do the steps of preprocessing that we did earlier. And then only we can predict whether a passenger will survive or not. Hence, I highly encourage you to do all the things for test.csv by yourself.
# Importing test.csv
test_data = pd.read_csv('/kaggle/input/titanic/test.csv')
PassengerId 0
Pclass 0
Name 0
Sex 0
Age 86
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 327
Embarked 0
dtype: int64
# Replacing missing values of age column
mean = test_data["Age"].mean()
std = test_data["Age"].std()
rand_age = np.random.randint(mean-std, mean+std, size = 86)
age_slice = test_data["Age"].copy()
age_slice[np.isnan(age_slice)] = rand_age
test_data["Age"] = age_slice
# Replacing missing value of Fare column
test_data['Fare'].fillna(test_data['Fare'].mean(), inplace=True)
col_to_drop = ["PassengerId", "Ticket", "Cabin", "Name"]
test_data.drop(col_to_drop, axis=1, inplace=True)
genders = {"male":0, "female":1}
test_data["Sex"] = test_data["Sex"].map(genders)
ports = {"S":0, "C":1, "Q":2}
test_data["Embarked"] = test_data["Embarked"].map(ports)
Machine Learning Project Submission
x_test = test_data
y_pred = clf1.predict(x_test)
originaltest_data = pd.read_csv('/kaggle/input/titanic/test.csv')
submission = pd.DataFrame({
"PassengerId": originaltest_data["PassengerId"],
"Survived": y_pred
So, this was all about the titanic survival prediction project. Don’t forget to submit your first notebook on Kaggle under this machine learning challenge. I would love to see your notebook on Kaggle and will surely upvote it. Moreover, don’t forget to check out this notebook having the complete code altogether that we’ve discussed.
