Titanic Survival Prediction – Machine Learning Project (Part-1)

Titanic Survival Prediction - Machine Learning Project (Part-1)

Hey, everyone! Today we’re going to work on our first Machine Learning Project and that is “Titanic Survival Prediction”. It’s going to be a really cool project. You’re going to apply all your machine learning skills in this project. Titanic survival prediction is a classification problem where your target is to predict whether the passenger of titanic will survive or not. I hope you’re excited about this project.

So before starting, you need to have lots of things on your system like python, anaconda (jupyter notebook), and many python packages like numpy, pandas, sklearn, etc. But wait, what if I tell you, you don’t need any one of them to make this project. Yes, we are not going to download anything on our system. You just need to sign up on Kaggle. Kaggle is a great platform that holds machine learning competition and provides real-world datasets. It offers a no-setup, customizable, Jupyter Notebooks environment. Now, this article will be more focused on how to think about machine learning projects, rather than just implementation. I suggest you practice the project in Kaggle itself with me.

Table of Contents

  1. Understanding the titanic survival prediction project
  2. Setting up things for any machine learning project.
  3. Understanding the titanic dataset.
  4. Importing necessary libraries & loading dataset
  5. Exploratory Data Analysis
  6. Data Preprocessing
  7. Building machine learning model
  8. Making Predictions
  9. Submission

Titanic Survival Prediction – Kaggle Challenge

This project is based on the Titanic dataset given on Kaggle. It’s a legendary titanic machine learning competition to kickstart your ML journey.  Moreover, the competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

The Titanic Challenge:

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, the widely considered “unsinkable” Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.
Goal: Build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data like age, gender, class, etc.

So, we’re going to solve this challenge from scratch and also submit our notebook on Kaggle under this competition. We’re going to cover everything like how to deal with the data, how to analyze data through different plots and graphs, how to do data preprocessing, and finally how to predict things.

First Machine Learning Project

  1. Firstly, Signup on Kaggle.
  2. Now, Go to this titanic survival challenge.
  3. Click on Code & create New Notebook
assignment advertisement
titanic survival challenge kaggle homepage

After creating new notebook you’ll see the following screen.

kaggle titanic prediction new notebook

Now this is your notebook where we’re going to make our whole project. Don’t delete the given code. Just press shift + enter to run this cell & to start your session. After this you’ll see this.

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv

Congratulations! You have successfully settled up all the required things.

Understanding the Titanic Dataset

So first we will understand our titanic dataset. This is a dataset of Titanic ship passengers & here

  • Each row represents the data of 1 passenger.
  • Columns represent the features. We have 10 features/ variables in this dataset.
  1. Survival: This variable shows whether the person survived or not. This is our target variable & we have to predict its value. It’s a binary variable. 0 means not survived and 1 means survived.
  2. pclass: The ticket class of passengers. 1st  (upper class), 2nd (middle), or 3rd (lower).
  3. Sex: Gender of passenger
  4. Age: Age (in years) of a passenger
  5. sibsp: The no. of siblings/spouses of a particular passenger who were there on the ship.
  6. parch: The no. of parents/children of a particular passenger who were there on the ship.
  7. ticket: Ticket Number
  8. fare: Passenger fare (like 1st class ticket fare must be greater than 2nd pr 3rd class ticket right)
  9. cabin: Cabin Number
  10. embarked: Port of Embarkation; From where that passenger took the ship. ( C = Cherbourg, Q = Queenstown, S = Southampton)

Clearly, it’s a classification problem because we need to tell whether the person will survive or not. We will use 5 different classifiers and compare their accuracy. The 5 different classifiers are as follows:

Let the fun begin!

Importing Libraries & Loading Dataset

You’ll see two options in your notebook Code & Markdown. Markdown are for creating bookmarks/notes. Code is for actual coding.

titanic survival prediction ml project

Click on markdown & write the following line. Press shift + enter.

# **1. Importing Necessary Libraries**

Now we’re going to need the following libraries:

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree,svm
from sklearn.metrics import accuracy_score

After this, we need to load our dataset to start working. To load our dataset we’ll use the read_csv method of pandas library. Observe the first 10 rows of the dataset. (Press Shift + enter to run the cell)

train_data = pd.read_csv('/kaggle/input/titanic/train.csv')

# Printing first 10 rows of the dataset
train_data.head(10)
print('The shape of our training set: %s passengers and %s features'%(train_data.shape[0],train_data.shape[1]))
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

As you can see we have 891 entries in total but some of the columns have less than 891 entries so that means we have missing values in these columns namely Age, Cabin & Embarked. So we have to preprocess our data first before training our ml model.

# Checking Null Values
train_data.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

There are 177 missing entries in Age column. 687 missing entries are in Cabin column and 2 missing are in Embarked.

Exploratory Data Analysis

Now we will analyze our data to see which variables are actually important to predict the value of the target variable. Hence, we are going to plot a heat map to see the correlation between the parameters and the target variable (Survived).

heatmap = sns.heatmap(train_data[["Survived", "SibSp", "Parch", "Age", "Fare"]].corr(), annot = True)
sns.set(rc={'figure.figsize':(12,10)})
Heat map of titanic dataset

Now see the decimal values in the above color 2D matrix. These values are the correlation values. Just compare the survived column with the rest of the columns. The lighter the color is the more correlated the value is. Let’s compare the Survived with Sibsp you’re getting the value -0.035. It means that SibSp is not correlated to Survived.  Then Parch has a 0.082 value which shows very little correlation. Then Age, again no correlation. In the end, we have Fare whose value of correlation with the Survived variable is 0.26 which shows that the more the fare is, the more are the chances of survival.

Conclusion: But it does not mean that the other features are useless. We’ll explore more about them later.

Moving on, now we will understand all the features one by one. We’ll visualize the impact of each feature on the target variable. Let us start with SibSp that is the no. of siblings or spouses a passenger has.

SibSp – Number of Siblings / Spouses aboard the Titanic

To visualize surviving probability with respect to SibSp we will plot a bar graph.

# Finding unique values
train_data['SibSp'].unique()
bargraph_sibsp = sns.catplot(x = "SibSp", y = "Survived", data = train_data, kind="bar", height = 8)
Titanic Survival Prediction - Machine Learning Project (Part-1) Hey, everyone! Today we’re going to work on our first Machine Learning Project and that is “Titanic Survival Prediction”. It’s going to be a really cool project. You’re going to apply all your machine learning skills in this project. Titanic survival prediction is a classification problem where your target is to predict whether the passenger of titanic will survive or not. I hope you're excited about this project.

Conclusion:

  • Passengers having 1 or 2 siblings have good chances of survival
  • More no. of siblings -> Fewer chances of survival

Age Column

We’ll plot a graph so as to see the distribution of age with respect to target variable.

ageplot = sns.FacetGrid(train_data, col="Survived", height = 7)
ageplot = ageplot.map(sns.distplot, "Age")
ageplot = ageplot.set_ylabels("Survival Probability")
Titanic Survival Prediction - Machine Learning Project (Part-1) Hey, everyone! Today we’re going to work on our first Machine Learning Project and that is “Titanic Survival Prediction”. It’s going to be a really cool project. You’re going to apply all your machine learning skills in this project. Titanic survival prediction is a classification problem where your target is to predict whether the passenger of titanic will survive or not. I hope you're excited about this project.

Conclusion More age -> less chances of survival!

Gender Column

For gender we are simply going to use seaborn and will plot a bar graph.

sexplot = sns.barplot(x="Sex", y="Survived", data=train_data)
Titanic Survival Prediction - Machine Learning Project (Part-1) Hey, everyone! Today we’re going to work on our first Machine Learning Project and that is “Titanic Survival Prediction”. It’s going to be a really cool project. You’re going to apply all your machine learning skills in this project. Titanic survival prediction is a classification problem where your target is to predict whether the passenger of titanic will survive or not. I hope you're excited about this project.

You can see from the above graph it’s quite obvious to say that man has less chances of survival over females. (Remember the Titanic scene when everyone was saying “Women and children first!” I want you to comment down “YES!” if you remember this scene.)

Pclass Column

Let us now see whether the class plays any role in survival probability or not.

pclassplot = sns.catplot(x = "Pclass", y="Survived", data = train_data, kind="bar", height = 6)
Titanic Survival Prediction - Machine Learning Project (Part-1) Hey, everyone! Today we’re going to work on our first Machine Learning Project and that is “Titanic Survival Prediction”. It’s going to be a really cool project. You’re going to apply all your machine learning skills in this project. Titanic survival prediction is a classification problem where your target is to predict whether the passenger of titanic will survive or not. I hope you're excited about this project.

So a first class passenger has more chances of survival over 2nd and 3rd class passengers & Similarly the 2nd class passengers have more chances of survival over 3rd class passengers.

That’s all about data exploratory analysis. Now we have a good idea about our data. And one more thing to notice here is there are some features which have nothing to do with survival probability like PassengerId, Ticket number, Cabin number and also the name of the passenger. So we can safely drop them before building our ml model. Moreover, we also need to handle missing values. So, all of these tasks come under Data Preprocessing.

We’re going to cover further topics in the next part till then keep learning & keep coding! Make sure you drop a comment “Nice article!” if you like this article.

Here’s the complete code we’ve discussed so far in this article: Kaggle Notebook.


Try these articles for machine learning basics:-


  • Machine Learning: A Gentle Introduction
    Introduction to Machine Learning Machine Learning is probably one of the most interesting and hyped branches of computer science. The thing that separates humans from machines is the fact that humans learn from their experiences. But is it possible to make a machine learn? And The answer is Yes! It is possible through Machine Learning….
  • Machine Learning Course Description
    Before you start, let me give you an overview of what this series has to offer you. Our machine learning course series comprises of the following sections:- ML Environment Setup and Overview Jupyter Notebook: The Ultimate Guide Numpy Pandas Matplotlib Seaborn Sklearn Linear Regression Logistic Regression Decision Tree Random Forest Support Vector Machine K Nearest…
  • ML Environment Setup and Overview
    Introduction to Machine Learning In this article, you will learn about the ML Environment Setup, Machine Learning terminology, its paradigms, and a tutorial to help you set up your machine so you can code what you learn. Before we start with our ML Environment Setup, read this article to get an overview of machine learning….
  • Jupyter Notebook: The Ultimate Guide
    Introduction to Jupyter Notebook Whenever one starts programming the first aim of that person is to find an IDE that suits his/her needs. In ML there are times when you’ll want to keep a check on your data after doing a change. But in code editors like Vim, Vscode, etc. you have to run your…
  • Numpy For Machine Learning: A Complete Guide
    Up until now you’ve learned about the general idea of what ML does, set up your environment, and got to know about the working of your coding environment i.e. Jupyter Notebook. In this section, you’ll learn about a very powerful library called Numpy. We’ll learn about Numpy Array(np array for short) and operations on them,…
  • Python Pandas Tutorial: A Complete Introduction for Beginners
    In the previous section, we learned about Numpy and how we can use it to load, save, and pre-process data easily by using Numpy Arrays. Now Numpy is a great library to do data preprocessing but I’d like to tell you all about another wonderful Python library called Pandas. At the end of this tutorial,…
  • Matplotlib Python: A Beginner’s Walkthrough
    We know how to analyze data by analyzing the statistics of the data and we’ve learned how to manipulate the data. But is statistics enough to analyze the data? Short answer, Visualization of data is necessary in order to find details that we missed that’s why Matplotlib Python is the best library to visualize data…
  • Seaborn: Create Elegant Plots
    In the previous tutorial, we learned why data visualization is important and how we can create plots using matplotlib. In this tutorial, we’ll learn about another data visualization library called Seaborn, which is built on top of matplotlib. But why do we need seaborn if we have matplotlib? Using seaborn you can make plots that…
  • Set up Python Environment
    Now, it’s time to install the tools that we will use to write programs. So, we will be learning to Set up Python Environment in this article. Let’s start. 1. Installing Python first. First, we need to go to the official site of python: https://www.python.org/ Now we need to go to the downloads page of…
  • Linear Regression: Your 1st Step in Machine Learning
    Hi guys! So until now, we’ve learned about how we can use libraries to play with data. We did data analysis on a real dataset and we also learned how to visualize data. But what was the purpose behind it? Why do so many things? What are we trying to achieve? I’ll tell you all…

Share:
Avatar of Umang Aggarwal

Author: Umang Aggarwal