Profile Analysis

October, 2023 by Leon Schedlin Czarlinski
Data
Python
Clustering

In this project we will cover the concept of clustering, which is a unsupervised learning algorithm that involves grouping similar data points together based on their characteristics. The goal of clustering is to find similarities within a dataset and group similar data points together while keeping dissimilar data points separate.

Think of this project from a business perspective. Based on the customer profile, the business can identify different clusters and customize the experience, offers, services, products, and others based on this clusterization.

Check the analysis

Below, I explain a little bit about the Data analytics methodology, tasks involved in the project, data and code source, and the conclusions.

Data Analytics Methodology

Transforming data into insights: the six steps of data analytics include: ask, prepare, process, analyze, share, and act.

visualization of the 6 steps methodology of transforming data into insights

You will see that this project will not cover all the steps above, but the idea of initially listing all steps is to show how to work in a data analysis process.

Tasks in this project

  1. Understand the problem statement
  2. Import libraries and data sets
  3. Perform exploratory analysis and data visualization
  4. Clustering with K-MEANS and DBSCAN
  5. Conclusions

Data source

For this project we will be using a dataset called "Mall Customer Segmentation Data" available at kaggle. Click here to access.

This data set is created only for the learning purpose of the customer segmentation concepts , also known as market basket analysis. There are 5 features that are available:

Feature Type Description
Customer ID Integer Unique ID assigned to the customer
Gender Categorical Gender of the applicant
Age Integer Age of the customer
Annual Income (k$) Integer Annual Income of the customer
Spending Score Integer Score assigned by the mall based on customer behavior and spending nature (1-100)

Code source

To develop the analysis, I used the Colab notebook available on Kaggle. Click here to access the file and see the results.


  #Data
  import pandas as pd
  import numpy as np
  
  #Data Visualization
  import matplotlib as plt
  from matplotlib import style
  
  #Clustering Models
  import seaborn as sns
  from sklearn.cluster import DBSCAN, KMeans
  
  #Ignore Warnings
  import warnings
  warnings.filterwarnings('ignore')
  
  #cd to directory with file
  import os
  for dirname, _, filenames in os.walk('/kaggle/input'):
      for filename in filenames:
          print(os.path.join(dirname, filename))