Clustering in Python for Data Science and ML

Clustering is a fundamental technique in unsupervised machine learning used to group data points into distinct clusters based on their similarity. This process helps in discovering hidden patterns, segmenting data, and gaining insights without prior knowledge of the groupings.

Key Concepts in Clustering

Popular Clustering Algorithms

1. K-Means Clustering

K-Means is one of the most widely used clustering algorithms. It partitions data into k predefined clusters, aiming to minimize the variance within each cluster. The algorithm iteratively assigns data points to the nearest centroid and updates the centroids based on the mean of the assigned points.

K-Means Example

Let's see how to implement K-Means using scikit-learn:


from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Initialize and fit K-Means model
kmeans = KMeans(n_clusters=4, random_state=0, n_init=10) # n_init=10 to suppress warning
kmeans.fit(X)

# Get cluster labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

# Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', marker='o', edgecolor='k', s=50, alpha=0.7)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200, label='Centroids')
plt.title('K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
                    
Run Code in Colab Download Script

2. Hierarchical Clustering

Hierarchical clustering builds a hierarchy of clusters. It can be either agglomerative (bottom-up) or divisive (top-down). Agglomerative clustering starts with each data point as its own cluster and iteratively merges the closest clusters until a single cluster remains or a stopping criterion is met.

Hierarchical Clustering Example

Using Agglomerative Clustering from scikit-learn:


from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

# Generate sample data (same as K-Means)
X, _ = make_blobs(n_samples=150, centers=3, cluster_std=0.70, random_state=0)

# Perform hierarchical clustering
# 'ward' minimizes the variance of the clusters being merged
linked = linkage(X, 'ward')

# Plotting the dendrogram
plt.figure(figsize=(12, 7))
dendrogram(linked,
            orientation='top',
            distance_sort='descending',
            show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

# To get clusters, we can cut the dendrogram at a certain height
# or specify the number of clusters
# Example: Cut to get 3 clusters
ac = AgglomerativeClustering(n_clusters=3)
labels_hier = ac.fit_predict(X)

# Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels_hier, cmap='plasma', marker='o', edgecolor='k', s=50, alpha=0.7)
plt.title('Hierarchical Clustering (k=3)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
                    
Run Code in Colab Download Script

3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN is a density-based clustering algorithm that groups together points that are closely packed together, marking points in low-density regions as outliers. It is effective at finding arbitrarily shaped clusters and does not require specifying the number of clusters beforehand.

DBSCAN Example

Implementing DBSCAN with scikit-learn:


from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons, make_circles
import matplotlib.pyplot as plt
import numpy as np

# Generate sample data with different shapes
X_moons, _ = make_moons(n_samples=200, noise=.05, random_state=0)
X_circles, _ = make_circles(n_samples=200, factor=.5, noise=.05, random_state=0)

# DBSCAN for moons
dbscan_moons = DBSCAN(eps=0.3, min_samples=5)
labels_moons = dbscan_moons.fit_predict(X_moons)

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.scatter(X_moons[:, 0], X_moons[:, 1], c=labels_moons, cmap='Set1', marker='o', edgecolor='k', s=50, alpha=0.7)
plt.title('DBSCAN on Moons')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, linestyle='--', alpha=0.6)

# DBSCAN for circles
dbscan_circles = DBSCAN(eps=0.4, min_samples=5)
labels_circles = dbscan_circles.fit_predict(X_circles)

plt.subplot(1, 2, 2)
plt.scatter(X_circles[:, 0], X_circles[:, 1], c=labels_circles, cmap='Set2', marker='o', edgecolor='k', s=50, alpha=0.7)
plt.title('DBSCAN on Circles')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, linestyle='--', alpha=0.6)

plt.tight_layout()
plt.show()
                    
Run Code in Colab Download Script

Choosing the Right Algorithm

The choice of clustering algorithm depends on your data characteristics and goals:

Evaluating Clustering Performance

Since clustering is unsupervised, evaluation is more complex. Common metrics include:

For datasets with ground truth labels (though uncommon in pure unsupervised settings), metrics like Adjusted Rand Index (ARI) or Fowlkes-Mallows Index can be used.

Further Resources