Introduction to Scikit-learn

Scikit-learn (often abbreviated as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

Scikit-learn is built upon NumPy, SciPy, and Matplotlib. It provides a clean and consistent API that makes it easy to use for both beginners and experienced practitioners.

Key Features

  • Simple and Efficient Tools: Built on NumPy, SciPy, and Matplotlib, offering ease of use and performance.
  • Wide Range of Algorithms: Comprehensive set of supervised and unsupervised learning algorithms.
  • Cross-validation: Tools for model evaluation and parameter selection.
  • Preprocessing: Tools for feature extraction and normalization.
  • Model Selection: Utilities for comparing models and finding optimal parameters.
  • Integrated Documentation: Well-documented API and examples.

Getting Started with Scikit-learn

Installation

You can install scikit-learn using pip:

pip install scikit-learn

Ensure you have NumPy and SciPy installed, as they are dependencies.

Basic Usage

Here's a simple example of using scikit-learn for a linear regression task:

Linear Regression Example


import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Linear Regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"Predicted values: {y_pred}")
                    

This code snippet demonstrates the fundamental steps: data preparation, model instantiation, training, prediction, and evaluation.

Core Concepts

  • Estimator API: All scikit-learn objects (models, transformers, etc.) share a consistent interface. Key methods are fit(), predict(), and transform().
  • Data Representation: Data is typically represented as NumPy arrays. Features are columns, and samples are rows.
  • Pipelines: Sequential application of data transformations and model training.
  • Model Selection: Techniques like cross-validation (cross_val_score) and grid search (GridSearchCV) for robust model evaluation and hyperparameter tuning.
  • Data Preprocessing: Tools like StandardScaler for feature scaling and OneHotEncoder for categorical feature encoding.

Common Use Cases and Examples

Classification

Scikit-learn offers various classification algorithms like Logistic Regression, Support Vector Machines (SVM), and Random Forests.

Scikit-learn Classification Example Visual
Visualizing Decision Boundaries for Classification

Example for a Support Vector Classifier:


from sklearn.svm import SVC
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

model_svc = SVC(kernel='linear')
model_svc.fit(X, y)
predictions = model_svc.predict(X)
print("Sample predictions for Iris dataset:", predictions[:5])
                
Explore Classification Algorithms

Clustering

Unsupervised learning algorithms like K-Means are available for grouping data.

Scikit-learn Clustering Example Visual
K-Means Clustering Visualization

Example for K-Means Clustering:


from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

X_blobs, _ = make_blobs(n_samples=100, centers=3, cluster_std=1.0, random_state=42)

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X_blobs)
labels = kmeans.labels_
print("Sample cluster labels:", labels[:10])
                
Explore Clustering Algorithms