Scikit-learn Clustering

Explore unsupervised learning techniques with Scikit-learn's robust clustering algorithms. Discover how to group similar data points without prior knowledge of the group labels.

What is Clustering?

Clustering is a type of unsupervised machine learning algorithm used to group data points into clusters based on their similarity. Unlike classification, clustering does not require labeled data, making it ideal for discovering hidden patterns and structures within datasets. It's widely used in customer segmentation, anomaly detection, image segmentation, and document analysis.

Key Clustering Algorithms in Scikit-learn

K-Means Clustering

K-Means is one of the simplest and most popular clustering algorithms. It partitions a dataset into a pre-defined number of clusters (k). The algorithm iteratively assigns data points to the nearest centroid and updates the centroids based on the mean of the assigned points.

When to Use:

  • When you have a good idea of the number of clusters (k) you expect.
  • For large datasets where computational efficiency is important.
  • When clusters are expected to be spherical and roughly equal in size.
from sklearn.cluster import KMeans from sklearn.datasets import make_blobs import matplotlib.pyplot as plt # Generate sample data X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) # Initialize KMeans with 4 clusters kmeans = KMeans(n_clusters=4, random_state=0, n_init=10) kmeans.fit(X) labels = kmeans.labels_ centers = kmeans.cluster_centers_ # Visualize the clusters plt.figure(figsize=8, 6) plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis') plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X') plt.title('K-Means Clustering Results') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.show()

Agglomerative Clustering

Agglomerative clustering is a hierarchical clustering method. It starts with each data point as its own cluster and iteratively merges the closest pairs of clusters until a stopping criterion (e.g., number of clusters) is met. This creates a hierarchy of clusters represented by a dendrogram.

When to Use:

  • When you want to explore the hierarchical structure of your data.
  • When the number of clusters is not known beforehand, and you want to visualize the merging process.
  • When clusters may have different shapes and sizes.
from sklearn.cluster import AgglomerativeClustering from sklearn.datasets import make_blobs import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage # Generate sample data X, _ = make_blobs(n_samples=150, centers=3, cluster_std=0.70, random_state=42) # Perform hierarchical clustering linked = linkage(X, 'ward') # Plot the dendrogram plt.figure(figsize=10, 7) dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True) plt.title('Hierarchical Clustering Dendrogram') plt.xlabel('Sample Index') plt.ylabel('Distance') plt.show() # Apply Agglomerative Clustering (e.g., with 3 clusters) agg_clustering = AgglomerativeClustering(n_clusters=3) labels = agg_clustering.fit_predict(X) # Visualize the clusters plt.figure(figsize=8, 6) plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='plasma') plt.title('Agglomerative Clustering Results') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.show()

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN groups together points that are closely packed together (points with many nearby neighbors), marking points that lie alone in low-density regions as outliers (noise). It does not require specifying the number of clusters beforehand.

When to Use:

  • When clusters are of arbitrary shape and size.
  • When you expect the presence of noise or outliers in your data.
  • When you don't know the number of clusters in advance.
from sklearn.cluster import DBSCAN from sklearn.datasets import make_moons import matplotlib.pyplot as plt import numpy as np # Generate sample data with non-convex shapes X, _ = make_moons(n_samples=200, noise=0.1, random_state=42) # Initialize and fit DBSCAN # eps: The maximum distance between two samples for one to be considered as in the neighborhood of the other. # min_samples: The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. dbscan = DBSCAN(eps=0.3, min_samples=5) labels = dbscan.fit_predict(X) # Visualize the clusters plt.figure(figsize=8, 6) unique_labels = set(labels) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))] for k, col in zip(unique_labels, colors): if k == -1: # Black used for noise. col = [0, 0, 0, 1] class_member_mask = (labels == k) xy = X[class_member_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6) plt.title('DBSCAN Clustering Results') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.show()

Choosing the Right Algorithm

The best clustering algorithm depends on your data and the goals of your analysis. Consider the following:

Evaluating Clustering Performance

Since clustering is unsupervised, evaluation is more challenging. Common metrics include:

For more details, refer to the official Scikit-learn clustering documentation.