Scikit-learn Clustering - MSDN Python Data Science & ML

Explore unsupervised learning techniques with Scikit-learn's robust clustering algorithms. Discover how to group similar data points without prior knowledge of the group labels.

What is Clustering?

Clustering is a type of unsupervised machine learning algorithm used to group data points into clusters based on their similarity. Unlike classification, clustering does not require labeled data, making it ideal for discovering hidden patterns and structures within datasets. It's widely used in customer segmentation, anomaly detection, image segmentation, and document analysis.

Key Clustering Algorithms in Scikit-learn

K-Means Clustering

K-Means is one of the simplest and most popular clustering algorithms. It partitions a dataset into a pre-defined number of clusters (k). The algorithm iteratively assigns data points to the nearest centroid and updates the centroids based on the mean of the assigned points.

When to Use:

When you have a good idea of the number of clusters (k) you expect.
For large datasets where computational efficiency is important.
When clusters are expected to be spherical and roughly equal in size.

                    
                        from sklearn.cluster import KMeans
                        from sklearn.datasets import make_blobs
                        import matplotlib.pyplot as plt

                        # Generate sample data
                        X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

                        # Initialize KMeans with 4 clusters
                        kmeans = KMeans(n_clusters=4, random_state=0, n_init=10)
                        kmeans.fit(X)
                        labels = kmeans.labels_
                        centers = kmeans.cluster_centers_

                        # Visualize the clusters
                        plt.figure(figsize=8, 6)
                        plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis')
                        plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')
                        plt.title('K-Means Clustering Results')
                        plt.xlabel('Feature 1')
                        plt.ylabel('Feature 2')
                        plt.show()
                    
                

Agglomerative Clustering

Agglomerative clustering is a hierarchical clustering method. It starts with each data point as its own cluster and iteratively merges the closest pairs of clusters until a stopping criterion (e.g., number of clusters) is met. This creates a hierarchy of clusters represented by a dendrogram.

When to Use:

When you want to explore the hierarchical structure of your data.
When the number of clusters is not known beforehand, and you want to visualize the merging process.
When clusters may have different shapes and sizes.

                    
                        from sklearn.cluster import AgglomerativeClustering
                        from sklearn.datasets import make_blobs
                        import matplotlib.pyplot as plt
                        from scipy.cluster.hierarchy import dendrogram, linkage

                        # Generate sample data
                        X, _ = make_blobs(n_samples=150, centers=3, cluster_std=0.70, random_state=42)

                        # Perform hierarchical clustering
                        linked = linkage(X, 'ward')

                        # Plot the dendrogram
                        plt.figure(figsize=10, 7)
                        dendrogram(linked,
                                    orientation='top',
                                    distance_sort='descending',
                                    show_leaf_counts=True)
                        plt.title('Hierarchical Clustering Dendrogram')
                        plt.xlabel('Sample Index')
                        plt.ylabel('Distance')
                        plt.show()

                        # Apply Agglomerative Clustering (e.g., with 3 clusters)
                        agg_clustering = AgglomerativeClustering(n_clusters=3)
                        labels = agg_clustering.fit_predict(X)

                        # Visualize the clusters
                        plt.figure(figsize=8, 6)
                        plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='plasma')
                        plt.title('Agglomerative Clustering Results')
                        plt.xlabel('Feature 1')
                        plt.ylabel('Feature 2')
                        plt.show()
                    
                

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN groups together points that are closely packed together (points with many nearby neighbors), marking points that lie alone in low-density regions as outliers (noise). It does not require specifying the number of clusters beforehand.

When to Use:

When clusters are of arbitrary shape and size.
When you expect the presence of noise or outliers in your data.
When you don't know the number of clusters in advance.

                    
                        from sklearn.cluster import DBSCAN
                        from sklearn.datasets import make_moons
                        import matplotlib.pyplot as plt
                        import numpy as np

                        # Generate sample data with non-convex shapes
                        X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)

                        # Initialize and fit DBSCAN
                        # eps: The maximum distance between two samples for one to be considered as in the neighborhood of the other.
                        # min_samples: The number of samples (or total weight) in a neighborhood for a point to be considered as a core point.
                        dbscan = DBSCAN(eps=0.3, min_samples=5)
                        labels = dbscan.fit_predict(X)

                        # Visualize the clusters
                        plt.figure(figsize=8, 6)
                        unique_labels = set(labels)
                        colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]

                        for k, col in zip(unique_labels, colors):
                            if k == -1:
                                # Black used for noise.
                                col = [0, 0, 0, 1]

                            class_member_mask = (labels == k)
                            xy = X[class_member_mask]
                            plt.plot(xy[:, 0], xy[:, 1], 'o',
                                     markerfacecolor=tuple(col),
                                     markeredgecolor='k',
                                     markersize=6)

                        plt.title('DBSCAN Clustering Results')
                        plt.xlabel('Feature 1')
                        plt.ylabel('Feature 2')
                        plt.show()
                    
                

Choosing the Right Algorithm

The best clustering algorithm depends on your data and the goals of your analysis. Consider the following:

Data Shape: K-Means works best for spherical clusters, while DBSCAN is suitable for arbitrary shapes.
Prior Knowledge: If you know the number of clusters, K-Means is a good starting point. If not, Agglomerative Clustering or DBSCAN can be more appropriate.
Noise Tolerance: DBSCAN is inherently designed to handle noise.
Computational Cost: K-Means is generally faster than hierarchical methods for large datasets.

Evaluating Clustering Performance

Since clustering is unsupervised, evaluation is more challenging. Common metrics include:

Silhouette Score: Measures how similar an object is to its own cluster compared to other clusters.
Davies-Bouldin Index: Calculates the average similarity ratio of each cluster with its most similar cluster.
Calinski-Harabasz Index: Computes the ratio of the between-cluster variance to within-cluster variance.

For more details, refer to the official Scikit-learn clustering documentation.