A Practical Guide in Python for Data Science and Machine Learning
Customer segmentation is a fundamental marketing strategy that divides a company's existing customers into groups based on shared characteristics. This allows businesses to tailor marketing campaigns, product development, and customer service efforts to specific segments, thereby increasing efficiency and effectiveness.
K-Means clustering is a popular unsupervised machine learning algorithm used for segmentation. It works by partitioning data points into a predefined number of clusters (k), where each data point belongs to the cluster with the nearest mean (cluster centroid).
For this demonstration, we'll use a hypothetical customer dataset that includes features such as:
Our goal is to segment customers into distinct groups based on their income and spending habits.
We'll leverage libraries like Pandas for data manipulation, Matplotlib and Seaborn for visualization, and Scikit-learn for the K-Means implementation.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
Let's assume you have a CSV file named customers.csv with 'Annual Income' and 'Spending Score' columns. For demonstration, we'll create a sample DataFrame.
# Sample Data Creation (replace with pd.read_csv('customers.csv'))
data = {
'CustomerID': range(1, 201),
'Gender': np.random.choice(['Male', 'Female'], 200),
'Age': np.random.randint(18, 70, 200),
'Annual Income (k$)': np.random.randint(15, 140, 200),
'Spending Score (1-100)': np.random.randint(1, 100, 200)
}
df = pd.DataFrame(data)
# For K-Means, we'll focus on 'Annual Income (k$)' and 'Spending Score (1-100)'
X = df[['Annual Income (k$)', 'Spending Score (1-100)']]
It's good practice to scale your features if they have different ranges, though in this case, the ranges are somewhat comparable. However, for more complex datasets, scaling is crucial.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
The Elbow method involves running K-Means for a range of k values and plotting the within-cluster sum of squares (inertia). The "elbow" point on the plot indicates the optimal k.
This plot shows the inertia (sum of squared distances of samples to their closest cluster center) for different numbers of clusters. The point where the rate of decrease sharply changes is considered the optimal 'k'.
inertia = []
k_range = range(1, 11)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X_scaled)
inertia.append(kmeans.inertia_)
plt.figure(figsize=(10, 6))
plt.plot(k_range, inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.xticks(k_range)
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
From the elbow curve (which would be generated if you ran the code), we can observe the optimal number of clusters. For this example, let's assume k=5 is chosen.
k = 5
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
df['Cluster'] = kmeans.fit_predict(X_scaled)
Now, let's visualize the segmented customers on a scatter plot, colored by their assigned cluster.
This scatter plot displays customers based on their Annual Income and Spending Score. Each color represents a different customer segment identified by the K-Means algorithm.
plt.figure(figsize=(12, 8))
sns.scatterplot(data=df, x='Annual Income (k$)', y='Spending Score (1-100)', hue='Cluster', palette='viridis', s=100, alpha=0.7)
plt.title('Customer Segments based on Income and Spending Score')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend(title='Cluster')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
Once the clusters are visualized and interpreted, we can analyze the characteristics of each segment to develop targeted strategies. For example, with 5 clusters:
Note: The exact interpretation depends on the actual data distribution and business context.
This is a basic example. For more robust segmentation, consider: