Clustering in Python for Data Science and ML
Clustering is a fundamental technique in unsupervised machine learning used to group data points into distinct clusters based on their similarity. This process helps in discovering hidden patterns, segmenting data, and gaining insights without prior knowledge of the groupings.
Key Concepts in Clustering
- Similarity/Dissimilarity: The core idea is to measure how alike or different data points are. Common metrics include Euclidean distance, Manhattan distance, and Cosine similarity.
- Cluster Centroids: In many algorithms, clusters are represented by a central point (centroid) that summarizes the data points within that cluster.
- Number of Clusters (k): Determining the optimal number of clusters is a common challenge and often requires domain knowledge or evaluation metrics.
Popular Clustering Algorithms
1. K-Means Clustering
K-Means is one of the most widely used clustering algorithms. It partitions data into k predefined clusters, aiming to minimize the variance within each cluster. The algorithm iteratively assigns data points to the nearest centroid and updates the centroids based on the mean of the assigned points.
Let's see how to implement K-Means using scikit-learn:
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Initialize and fit K-Means model
kmeans = KMeans(n_clusters=4, random_state=0, n_init=10) # n_init=10 to suppress warning
kmeans.fit(X)
# Get cluster labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
# Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', marker='o', edgecolor='k', s=50, alpha=0.7)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200, label='Centroids')
plt.title('K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
Run Code in Colab
Download Script
2. Hierarchical Clustering
Hierarchical clustering builds a hierarchy of clusters. It can be either agglomerative (bottom-up) or divisive (top-down). Agglomerative clustering starts with each data point as its own cluster and iteratively merges the closest clusters until a single cluster remains or a stopping criterion is met.
Using Agglomerative Clustering from scikit-learn:
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
# Generate sample data (same as K-Means)
X, _ = make_blobs(n_samples=150, centers=3, cluster_std=0.70, random_state=0)
# Perform hierarchical clustering
# 'ward' minimizes the variance of the clusters being merged
linked = linkage(X, 'ward')
# Plotting the dendrogram
plt.figure(figsize=(12, 7))
dendrogram(linked,
orientation='top',
distance_sort='descending',
show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
# To get clusters, we can cut the dendrogram at a certain height
# or specify the number of clusters
# Example: Cut to get 3 clusters
ac = AgglomerativeClustering(n_clusters=3)
labels_hier = ac.fit_predict(X)
# Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels_hier, cmap='plasma', marker='o', edgecolor='k', s=50, alpha=0.7)
plt.title('Hierarchical Clustering (k=3)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
Run Code in Colab
Download Script
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN is a density-based clustering algorithm that groups together points that are closely packed together, marking points in low-density regions as outliers. It is effective at finding arbitrarily shaped clusters and does not require specifying the number of clusters beforehand.
Implementing DBSCAN with scikit-learn:
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons, make_circles
import matplotlib.pyplot as plt
import numpy as np
# Generate sample data with different shapes
X_moons, _ = make_moons(n_samples=200, noise=.05, random_state=0)
X_circles, _ = make_circles(n_samples=200, factor=.5, noise=.05, random_state=0)
# DBSCAN for moons
dbscan_moons = DBSCAN(eps=0.3, min_samples=5)
labels_moons = dbscan_moons.fit_predict(X_moons)
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.scatter(X_moons[:, 0], X_moons[:, 1], c=labels_moons, cmap='Set1', marker='o', edgecolor='k', s=50, alpha=0.7)
plt.title('DBSCAN on Moons')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, linestyle='--', alpha=0.6)
# DBSCAN for circles
dbscan_circles = DBSCAN(eps=0.4, min_samples=5)
labels_circles = dbscan_circles.fit_predict(X_circles)
plt.subplot(1, 2, 2)
plt.scatter(X_circles[:, 0], X_circles[:, 1], c=labels_circles, cmap='Set2', marker='o', edgecolor='k', s=50, alpha=0.7)
plt.title('DBSCAN on Circles')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()
Run Code in Colab
Download Script
Choosing the Right Algorithm
The choice of clustering algorithm depends on your data characteristics and goals:
- K-Means: Good for spherical clusters, efficient for large datasets, but sensitive to initial centroids and requires specifying
k. - Hierarchical Clustering: Provides a dendrogram visualization useful for understanding cluster relationships, doesn't require pre-specifying
kbut can be computationally expensive for large datasets. - DBSCAN: Excellent for arbitrarily shaped clusters and identifying outliers, does not require pre-specifying
kbut sensitive to parameter tuning (epsandmin_samples).
Evaluating Clustering Performance
Since clustering is unsupervised, evaluation is more complex. Common metrics include:
- Silhouette Score: Measures how similar a data point is to its own cluster compared to other clusters. Ranges from -1 to 1.
- Davies-Bouldin Index: Measures the average similarity ratio of each cluster with its most similar cluster. Lower values indicate better clustering.
- Calinski-Harabasz Index (Variance Ratio Criterion): Measures the ratio of between-cluster variance to within-cluster variance. Higher values indicate better clustering.
For datasets with ground truth labels (though uncommon in pure unsupervised settings), metrics like Adjusted Rand Index (ARI) or Fowlkes-Mallows Index can be used.