Customer Segmentation using K-Means

A Practical Guide in Python for Data Science and Machine Learning

Introduction to Customer Segmentation

Customer segmentation is a fundamental marketing strategy that divides a company's existing customers into groups based on shared characteristics. This allows businesses to tailor marketing campaigns, product development, and customer service efforts to specific segments, thereby increasing efficiency and effectiveness.

K-Means clustering is a popular unsupervised machine learning algorithm used for segmentation. It works by partitioning data points into a predefined number of clusters (k), where each data point belongs to the cluster with the nearest mean (cluster centroid).

The Dataset

For this demonstration, we'll use a hypothetical customer dataset that includes features such as:

Our goal is to segment customers into distinct groups based on their income and spending habits.

Implementation with Python

We'll leverage libraries like Pandas for data manipulation, Matplotlib and Seaborn for visualization, and Scikit-learn for the K-Means implementation.

Step 1: Import Libraries and Load Data


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
            

Let's assume you have a CSV file named customers.csv with 'Annual Income' and 'Spending Score' columns. For demonstration, we'll create a sample DataFrame.


# Sample Data Creation (replace with pd.read_csv('customers.csv'))
data = {
    'CustomerID': range(1, 201),
    'Gender': np.random.choice(['Male', 'Female'], 200),
    'Age': np.random.randint(18, 70, 200),
    'Annual Income (k$)': np.random.randint(15, 140, 200),
    'Spending Score (1-100)': np.random.randint(1, 100, 200)
}
df = pd.DataFrame(data)

# For K-Means, we'll focus on 'Annual Income (k$)' and 'Spending Score (1-100)'
X = df[['Annual Income (k$)', 'Spending Score (1-100)']]
            

Step 2: Data Preprocessing (Scaling)

It's good practice to scale your features if they have different ranges, though in this case, the ranges are somewhat comparable. However, for more complex datasets, scaling is crucial.


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
            

Step 3: Determine the Optimal Number of Clusters (k) using the Elbow Method

The Elbow method involves running K-Means for a range of k values and plotting the within-cluster sum of squares (inertia). The "elbow" point on the plot indicates the optimal k.


inertia = []
k_range = range(1, 11)
for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(k_range, inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.xticks(k_range)
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
            

From the elbow curve (which would be generated if you ran the code), we can observe the optimal number of clusters. For this example, let's assume k=5 is chosen.

Step 4: Apply K-Means Clustering


k = 5
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
df['Cluster'] = kmeans.fit_predict(X_scaled)
            

Step 5: Visualize the Clusters

Now, let's visualize the segmented customers on a scatter plot, colored by their assigned cluster.


plt.figure(figsize=(12, 8))
sns.scatterplot(data=df, x='Annual Income (k$)', y='Spending Score (1-100)', hue='Cluster', palette='viridis', s=100, alpha=0.7)
plt.title('Customer Segments based on Income and Spending Score')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend(title='Cluster')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
            

Interpreting the Segments

Once the clusters are visualized and interpreted, we can analyze the characteristics of each segment to develop targeted strategies. For example, with 5 clusters:

Note: The exact interpretation depends on the actual data distribution and business context.

Further Enhancements

This is a basic example. For more robust segmentation, consider: