Customer Segmentation using K-Means - Python Data Science & ML

Introduction to Customer Segmentation

Customer segmentation is a fundamental marketing strategy that divides a company's existing customers into groups based on shared characteristics. This allows businesses to tailor marketing campaigns, product development, and customer service efforts to specific segments, thereby increasing efficiency and effectiveness.

K-Means clustering is a popular unsupervised machine learning algorithm used for segmentation. It works by partitioning data points into a predefined number of clusters (k), where each data point belongs to the cluster with the nearest mean (cluster centroid).

The Dataset

For this demonstration, we'll use a hypothetical customer dataset that includes features such as:

Annual Income (in thousands of dollars)
Spending Score (1-100)

Our goal is to segment customers into distinct groups based on their income and spending habits.

Implementation with Python

We'll leverage libraries like Pandas for data manipulation, Matplotlib and Seaborn for visualization, and Scikit-learn for the K-Means implementation.

Step 1: Import Libraries and Load Data


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

Let's assume you have a CSV file named customers.csv with 'Annual Income' and 'Spending Score' columns. For demonstration, we'll create a sample DataFrame.


# Sample Data Creation (replace with pd.read_csv('customers.csv'))
data = {
    'CustomerID': range(1, 201),
    'Gender': np.random.choice(['Male', 'Female'], 200),
    'Age': np.random.randint(18, 70, 200),
    'Annual Income (k$)': np.random.randint(15, 140, 200),
    'Spending Score (1-100)': np.random.randint(1, 100, 200)
}
df = pd.DataFrame(data)

# For K-Means, we'll focus on 'Annual Income (k$)' and 'Spending Score (1-100)'
X = df[['Annual Income (k$)', 'Spending Score (1-100)']]

Step 2: Data Preprocessing (Scaling)

It's good practice to scale your features if they have different ranges, though in this case, the ranges are somewhat comparable. However, for more complex datasets, scaling is crucial.


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Step 3: Determine the Optimal Number of Clusters (k) using the Elbow Method

The Elbow method involves running K-Means for a range of k values and plotting the within-cluster sum of squares (inertia). The "elbow" point on the plot indicates the optimal k.

Elbow Method Visualization

This plot shows the inertia (sum of squared distances of samples to their closest cluster center) for different numbers of clusters. The point where the rate of decrease sharply changes is considered the optimal 'k'.


inertia = []
k_range = range(1, 11)
for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(k_range, inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.xticks(k_range)
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

From the elbow curve (which would be generated if you ran the code), we can observe the optimal number of clusters. For this example, let's assume k=5 is chosen.

Step 4: Apply K-Means Clustering


k = 5
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
df['Cluster'] = kmeans.fit_predict(X_scaled)

Step 5: Visualize the Clusters

Now, let's visualize the segmented customers on a scatter plot, colored by their assigned cluster.

Customer Segmentation Scatter Plot Placeholder

Customer Segmentation Visualization

This scatter plot displays customers based on their Annual Income and Spending Score. Each color represents a different customer segment identified by the K-Means algorithm.


plt.figure(figsize=(12, 8))
sns.scatterplot(data=df, x='Annual Income (k$)', y='Spending Score (1-100)', hue='Cluster', palette='viridis', s=100, alpha=0.7)
plt.title('Customer Segments based on Income and Spending Score')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend(title='Cluster')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

Interpreting the Segments

Once the clusters are visualized and interpreted, we can analyze the characteristics of each segment to develop targeted strategies. For example, with 5 clusters:

Cluster 0: High Income, High Spending Score (Target for premium services)
Cluster 1: Low Income, Low Spending Score (Focus on value and affordability)
Cluster 2: Medium Income, Medium Spending Score (General market)
Cluster 3: High Income, Low Spending Score (Potential for upselling or loyalty programs)
Cluster 4: Low Income, High Spending Score (Focus on budget-friendly offers that appeal to impulse buyers)

Note: The exact interpretation depends on the actual data distribution and business context.

Further Enhancements

This is a basic example. For more robust segmentation, consider:

Incorporating more features (e.g., age, purchase history, demographics).
Using different clustering algorithms (e.g., DBSCAN, Hierarchical Clustering).
Applying feature engineering and selection techniques.
Validating cluster quality using silhouette scores or other metrics.