Diabetes prediction using Machine Learning

Diabetes prediction using Machine Learning

In this article, we are going to build a project on Diabetes Prediction using Machine Learning. Machine Learning is very useful in the medical field to detect many diseases in their early stage. Diabetes prediction is one such Machine Learning model which helps to detect diabetes in humans. Also, we will see how to Deploy a Machine Learning model using Streamlit. Let’s get started!

About Diabetes Prediction using Machine Learning Project

Project Name:Diabetes prediction using Machine Learning
AbstractIt’s an ML-based Project which involves most of the ML steps like Collection of data, Exploring the data, Splitting of data, etc
Language/s Used:Python
IDEGoogle Colab(Recommended)
Python version (Recommended):Python 3.7
Database:Not required
Recommended forFinal Year Students

Steps involved in Diabetes Prediction Project

The workflow to build the end-to-end Machine Learning Project to predict Diabetes is as follows-

  1. Collection of data
  2. Exploring the data
  3. Splitting the data
  4. Training the model
  5. Evaluating the model
  6. Deploying the model

1. Data collection

The very first step is to choose the dataset for our model. We can get a lot of different datasets from Kaggle. You just need to sign in to Kaggle and search for any dataset you need for the project. The Diabetes dataset required for our model can be downloaded here.

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict whether a patient has diabetes based on diagnostic measurements. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The data contains 9 columns which are as follows

  • Pregnancies: Number of times pregnant
  • Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  • BloodPressure: Diastolic blood pressure (mm Hg)
  • SkinThickness: Triceps skin fold thickness (mm)
  • Insulin: 2-Hour serum insulin (mu U/ml)
  • BMI: Body mass index (weight in kg/(height in m)^2)
  • DiabetesPedigreeFunction: Diabetes pedigree function
  • Age: Age (years)
  • Outcome: Class variable (0 or 1)

2. Exploring the Data

Now we have to set the development environment to build our project. For this project, we are going to build this Diabetes prediction using Machine Learning in Google Colab. You can also use Jupyter Notebook.

After downloading the dataset, import the necessary libraries to build the model.

# Import the required libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score
import pickle

Load the data using the read_csv method in the pandas library. Then the head() method in the pandas library is used to print the rows up to the limit we specify. The default number of rows is five.

# Load the diabetes dataset to a pandas DataFrame
diabetes_dataset = pd.read_csv('diabetes.csv') 

# Print the first 5 rows of the dataset


Printing the first 5 rows of the dataset
# To get the number of rows and columns in the dataset
#prints (768, 9)

# To get the statistical measures of the data


statistical measures of the data in Diabetes prediction using Machine Learning

And, it is clear that the Outcome column is the output variable. So let us explore more details about that column.

# To get details of the outcome column


details of the outcome column

In the output, the value 1 means the person is having Diabetes, and 0 means the person is not having Diabetes. We can see the total count of people with and without Diabetes.

3. Splitting the data

The next step in the building of the Machine learning model is splitting the data into training and testing sets. The training and testing data should be split in a ratio of 3:1 for better prediction results.

# separating the data and labels
X = diabetes_dataset.drop(columns = 'Outcome', axis=1)
Y = diabetes_dataset['Outcome']

# To print the independent variables


print the independent variables
# To print the outcome variable


printing the outcome variable
#Split the data into train and test
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2, stratify=Y, random_state=2)
print(X.shape, X_train.shape, X_test.shape)


(768, 8) (614, 8) (154, 8)

4. Training the model

The next step is to build and train our model. We are going to use a Support vector classifier algorithm to build our model.

# Build the model
classifier = svm.SVC(kernel='linear')

# Train the support vector Machine Classifier
classifier.fit(X_train, Y_train)

After building the model, the model has to predict output with test data. After the prediction of the outcome with test data, we can calculate the accuracy score of the prediction results by the model.

# Accuracy score on the training data
X_train_prediction = classifier.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

print('Accuracy score of the training data : ', training_data_accuracy)

# Accuracy score on the test data
X_test_prediction = classifier.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

print('Accuracy score of the test data : ', test_data_accuracy)


Accuracy score of the training data: 0.7833876221498371
Accuracy score of the test data: 0.7727272727272727

5. Evaluating the model

input_data = (5,166,72,19,175,25.8,0.587,51)

# Change the input_data to numpy array
input_data_as_numpy_array = np.asarray(input_data)

# Reshape the array for one instance
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

prediction = classifier.predict(input_data_reshaped)

if (prediction[0] == 0):
  print('The person is not diabetic')
  print('The person is diabetic')


The person is diabetic

Saving the file

# Save the trained model
filename = 'trained_model.sav'
pickle.dump(classifier, open(filename, 'wb'))

# Load the saved model
loaded_model = pickle.load(open('trained_model.sav', 'rb'))

Once you run this code a new file named trained_model.sav will be saved in the project folder.

Deploying the model

One of the most important and final steps in building a Machine Learning project is Model deployment. There are many frameworks available for deploying the Machine learning model on the web. Some of the most used Python frameworks are Django and Flask. But these frameworks require a little knowledge of languages such as HTML, CSS, and JavaScript.

So, a new framework known as Streamlit was introduced to deploy the Machine Learning model without the need to have the knowledge of Front End Languages. It is quite easy to deploy using Streamlit. So, we will use the Streamlit framework to deploy our model. Although Streamlit has many advantages over the other frameworks, lot more features are under development. If you are getting started in Machine Learning then this framework will be a perfect start to deploy your machine learning model on the web.

Python Code to Deploy ML model using Streamlit

To install Streamlit run the following command in the command prompt or terminal.

pip install streamlit

Open a new Python file and put the following code.


import numpy as np
import pickle
import streamlit as st

# Load the saved model
loaded_model = pickle.load(open('C:/Users/ELCOT/Downloads/trained_model.sav', 'rb'))

# Create a function for Prediction
def diabetes_prediction(input_data):

    # Change the input_data to numpy array
    input_data_as_numpy_array = np.asarray(input_data)

    # Reshape the array as we are predicting for one instance
    input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)
    prediction = loaded_model.predict(input_data_reshaped)

    if (prediction[0] == 0):
      return 'The person is not diabetic'
      return 'The person is diabetic'
def main():
    # Give a title
    st.title('Diabetes Prediction Web App')
    # To get the input data from the user
    Pregnancies = st.text_input('Number of Pregnancies')
    Glucose = st.text_input('Glucose Level')
    BloodPressure = st.text_input('Blood Pressure value')
    SkinThickness = st.text_input('Skin Thickness value')
    Insulin = st.text_input('Insulin Level')
    BMI = st.text_input('BMI value')
    DiabetesPedigreeFunction = st.text_input('Diabetes Pedigree Function value')
    Age = st.text_input('Age of the Person')
    # Code for Prediction
    diagnosis = ''
    # Create a button for Prediction
    if st.button('Diabetes Test Result'):
        diagnosis = diabetes_prediction([Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age])
if __name__ == '__main__':

Save the file after pasting the code. And then to deploy using streamlit go to command prompt and run the following command.

streamlit run App.py
streamlit run filename.py
Command Prompt output for Python Code to Deploy ML model using Streamlit

After running the command the web app will open in the localhost webserver. Otherwise, go to your browser and type localhost:8501. The following output will be shown.


Browser output for Python Code to Deploy ML model using Streamlit

Sample Input data for a person does not have diabetes is {1, 85, 66, 29, 0, 26.6, 0.351, 31}. These data as input will generate the following output in the web app.

Output 1

Sample input data for a person who have diabetes is {6, 148, 72, 35, 0, 33.6, 0.627, 50}. These data as input will generate the following output in the web app.

Output 2


In this article, we learned how to build a project on Diabetes Prediction using Machine Learning(with all 5 proper ML steps) and deploy it using Streamlit. Hope you enjoyed doing this project. Happy Learning!

Also Read:


Author: Ayush Purawr