Customer Behaviour Analysis – Machine Learning and Python

Customer Behaviour Analysis - Machine Learning and Python


A company runs successfully due to its customers. Understanding the need of customers and fulfilling them through the products is the aim of the company.
Most successful businesses achieved the heights by knowing the need of customers and dynamically changing their strategies and development process.

Customer Behaviour Analysis is as important as a customer for a company or firm. It is a qualitative and quantitative observation of how customers interact with the company. It is important to ensure customers’ needs and desires, contributing to customer loyalty and retention.

What is Customer Analysis?

According to, Customer analytics is critical for gaining a thorough understanding of customers’ purchasing habits, use trends, demographic distribution, and profitability. Organizations must devote a significant amount of time and resources to learning about their consumers and analyzing the data created by their interactions with them.

How customer behavior analysis can be carried out?

On a small scale or where the data is very less, this can be done manually by understanding the data.

Wherein, cases, where the customer’s dataset is huge and it is almost impossible to look for individual rows manually and find some patterns from data, Machine Learning comes to the rescue!

There are several built-in algorithms and models available in Machine Learning, making the analysis procedure at peak of easy.

There are various frameworks designed and used by data scientists to carry out this analysis. One of the frameworks is,


STP stands for Segmentation, Targeting, and Positioning. It is a three-step marketing framework. With the STP process, it is easy to segment the market, target the customers, and position the offering to each segment.

STP in Customer Behaviour Analysis


Segmentation is the process of dividing a population, potential or existing customers into groups with shared similar characteristics. This group will have comparable purchase behaviors. This Segment will likely respond to different marketing activities.


Targeting deals with the evaluation of potential profits from segments and deciding which segments to focus on. Considering factors to determine where to extend to the whole segment or part of the segment.


After deciding where to Target, positioning is next. Positioning in marketing is a strategic process that entails developing an identity of a brand or product in the mind of potential buyers.

Customer segmentation with the STP framework can be carried out using PCA, Hierarchical Clustering, and K-Means Algorithm in Python using supported libraries like Numpy, Matplotlib, Seaborn, StandardScaler, Scipy, etc.


RFM segmentation enables marketers to target specific groups of consumers with communications that are far more relevant to their individual behaviors. This practice results in much greater response rates and improved loyalty and customer lifetime value.

RFM segmentation is an effective tool to identify groups of consumers who should be treated differently. RFM stands for Recency, Frequency, and Monetary.

The advantage of RFM over other segmentation models is that it employs objective numerical scales to produce a high-level picture of consumers that is both succinct and instructive. Also, marketers can utilize it without expensive tools. And the most important factor, the segmentation method’s output is simple to comprehend and analyze.


Clustering is the unsupervised learning method. It is a technique where a group of data is partitioned into different clusters. There are many popular algorithms for clustering like the K-means algorithm, Mean Shift, BIRCH, OPTICS, Special Clustering, etc. Each algorithm offers a different approach to the challenge of discovering natural groups in data.

All the clusters are significantly different from each other while all the data elements inside the same clusters are of the same category. Distance-based clustering groups the points into some number of clusters such that distances within the cluster should be small while distances between clusters should be large.

One of the most powerful and popular algorithms in Python, the K-means Clustering algorithm can be used to categorize data into clusters.

For eg., if you want to target any particular type of customer for a new product, you can cluster the data into the corresponding categories and really try to find the patterns in between them. And then analyzing the output of various algorithms or PCA to come to a solution.

Logistic Regression

Logistic Regression is a linear classifier, which predicts the outcome as a categorical value. Maybe True or False, 1 or 0, Human or Animal, etc. kinds of binary outcomes are predicted by logistic regression.

On the basis of logistic regression, we can predict if the customer can have an interest in a particular product.

For eg, here, we are going to understand in little depth by taking a problem statement and then generating the outcome related to customer behavior analysis.

Problem Statement:

A company wants to find out whether their old or new customer is interested in buying a car. We have to predict the possibility from the dataset given.

The dataset:

You can download the dataset from here:

In the dataset, we have UserID, Gender, Age, Salary, and a binary variable Purchased. Out of all these, UserID, Gender, Age and Salary are independent variables whereas if the customer has purchased the product is the dependent term.


The marketing team can target only those customers who have the possibility to buy the products.

Hence, we will predict 0 or 1 as outcomes representing whether the customer will buy a car or not.

Let’s begin:

Import the libraries:

To preprocess data we’ll need,

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Import the dataset:

We need age and estimated salary as our independent variables, so we’ll only consider them as X.

Y would be the dependent variable ‘Purchased’ which has to be predicted

dataset = pd.read_csv('Social_Network_Ads.csv')

X = dataset.iloc[:, 2:4].values 
y = dataset.iloc[:, -1].values

Split the dataset into train and test:

We will use scikit learns train_test_split method to split our dataset into training and testing.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

We can check the data,

Output 1, x_train, Customer Behaviour Analysis in Python

We can see large rows of train data.

Similarly, we can check X_test, y_train, and y_test.

Output 2, x_test, Customer Behaviour Analysis in Python
Output 3, y_train, Customer Behaviour Analysis in Python
Output 4, y_test, Customer Behaviour Analysis in Python

Feature Scaling:

What is feature scaling?

Feature Scaling is a data pre-processing step and applied to independent variables or features of the data. It basically helps to normalize the data within a particular range. Sometimes, it also helps in speeding up the calculations in an algorithm.

Here, it is not required but improving training performance is necessary.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Here, we are normalizing the train and test data.

x_train after normalization
Normalized X_train
x_testafter normalization
Normalized X_test

Build Model with Logistic Regression:

When comes to Python, has all the powerful models built in, hence is no need to create anything from scratch. Just import the module and fit the data into it to train the model.

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0), y_train)

Test result prediction:

Let’s predict the result for the test data.

We are comparing this prediction with our y_train i.e. original result to see how our model is doing.

y_pred = classifier.predict(X_test)
model prediction
Model prediction

Hmm, it is quite well. But how well?

Accuracy score with Confusion Matrix:

We’ll use the confusion matrix utility to calculate the accuracy of this model.

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
confusion matrix

Here, for test data, 65 is the correct prediction for class 0 i.e. not buy and 3 is incorrect for class 1 i.e. buy, and in training 8 is incorrect for class 0 and 24 is correct for class 1.

Let’s calculate the accuracy score:

accuracy_score(y_test, y_pred)

Hurray! Our model is 89% true.

Predict for new data:

Let’s take the input value and depict what our model predicts:

Predicting new results:

age=int(input("Enter the age: "))
salary = int(input("Enter the estimated salary: "))
result = classifier.predict(sc.transform([[age,salary]]))
predicting new results

Let’s make the result more readable:

if result==[1]:
  print("Yay! This customer can buy a car!")
  print("Sorry! It seems this customer won't buy a car")

Let’s print the result again for the previous inputs.

making readable output

Let’s run for one more,

checking modal finally
Model Check

Seems great!
For more detailed info, you can check

This is how we can really use ML integrated with Python and build such analysis to set the target and retain the customers.

Thank you for visiting, don’t forget to check

Also Read:


Author: Ayush Purawr