Explore unsupervised learning techniques with Scikit-learn's robust clustering algorithms. Discover how to group similar data points without prior knowledge of the group labels.
Clustering is a type of unsupervised machine learning algorithm used to group data points into clusters based on their similarity. Unlike classification, clustering does not require labeled data, making it ideal for discovering hidden patterns and structures within datasets. It's widely used in customer segmentation, anomaly detection, image segmentation, and document analysis.
K-Means is one of the simplest and most popular clustering algorithms. It partitions a dataset into a pre-defined number of clusters (k). The algorithm iteratively assigns data points to the nearest centroid and updates the centroids based on the mean of the assigned points.
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Initialize KMeans with 4 clusters
kmeans = KMeans(n_clusters=4, random_state=0, n_init=10)
kmeans.fit(X)
labels = kmeans.labels_
centers = kmeans.cluster_centers_
# Visualize the clusters
plt.figure(figsize=8, 6)
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')
plt.title('K-Means Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
Agglomerative clustering is a hierarchical clustering method. It starts with each data point as its own cluster and iteratively merges the closest pairs of clusters until a stopping criterion (e.g., number of clusters) is met. This creates a hierarchy of clusters represented by a dendrogram.
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
# Generate sample data
X, _ = make_blobs(n_samples=150, centers=3, cluster_std=0.70, random_state=42)
# Perform hierarchical clustering
linked = linkage(X, 'ward')
# Plot the dendrogram
plt.figure(figsize=10, 7)
dendrogram(linked,
orientation='top',
distance_sort='descending',
show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()
# Apply Agglomerative Clustering (e.g., with 3 clusters)
agg_clustering = AgglomerativeClustering(n_clusters=3)
labels = agg_clustering.fit_predict(X)
# Visualize the clusters
plt.figure(figsize=8, 6)
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='plasma')
plt.title('Agglomerative Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
DBSCAN groups together points that are closely packed together (points with many nearby neighbors), marking points that lie alone in low-density regions as outliers (noise). It does not require specifying the number of clusters beforehand.
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
import numpy as np
# Generate sample data with non-convex shapes
X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)
# Initialize and fit DBSCAN
# eps: The maximum distance between two samples for one to be considered as in the neighborhood of the other.
# min_samples: The number of samples (or total weight) in a neighborhood for a point to be considered as a core point.
dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X)
# Visualize the clusters
plt.figure(figsize=8, 6)
unique_labels = set(labels)
colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o',
markerfacecolor=tuple(col),
markeredgecolor='k',
markersize=6)
plt.title('DBSCAN Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
The best clustering algorithm depends on your data and the goals of your analysis. Consider the following:
Since clustering is unsupervised, evaluation is more challenging. Common metrics include:
For more details, refer to the official Scikit-learn clustering documentation.