Scaling and Normalization
Table of Contents
Why Scaling Matters
Many machine‑learning algorithms assume that all features are on a comparable scale. Without proper scaling, models such as k‑nearest neighbors, logistic regression, support vector machines, and neural networks can converge slowly or produce biased results.
Common Scaling Methods
| Method | When to Use | Characteristics |
|---|---|---|
| Standardization (Z‑score) | Normally distributed data | Centers data to mean 0, variance 1 |
| Min‑Max Scaling | Bounded features | Rescales to a fixed range, usually [0,1] |
| Robust Scaling | Data with outliers | Uses median & IQR; less sensitive to outliers |
| MaxAbs Scaling | Sparse data | Preserves sparsity, scales to [-1,1] |
Standardization (Z‑score)
Transforms each feature \(x\) to \((x-\mu)/\sigma\), where \(\mu\) is the mean and \(\sigma\) the standard deviation.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
Min‑Max Scaling
Rescales each feature to a given range \([a, b]\), commonly \([0, 1]\).
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(0,1)) X_scaled = scaler.fit_transform(X)
Robust Scaling
Uses the median and the inter‑quartile range (IQR) to scale the data, making it robust to outliers.
from sklearn.preprocessing import RobustScaler scaler = RobustScaler() X_scaled = scaler.fit_transform(X)
MaxAbs Scaling
Scales each feature by its maximum absolute value, preserving zero entries and sparsity.
from sklearn.preprocessing import MaxAbsScaler scaler = MaxAbsScaler() X_scaled = scaler.fit_transform(X)
Practical Tips
- Fit scalers on the training set only; apply the same transformation to validation/test sets.
- When using pipelines, place scaling steps before model training.
- Inspect feature distributions after scaling to ensure expected behavior.
- For tree‑based models (e.g., RandomForest, XGBoost) scaling is often unnecessary.
Full Example (Python)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
# Load data
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
# Train‑test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y)
# Build pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(max_iter=1000))
])
# Train
pipeline.fit(X_train, y_train)
# Predict & evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
References
- scikit‑learn Documentation – Preprocessing
- Hands‑On Machine Learning with Scikit‑Learn, Keras & TensorFlow – Aurelien Géron
- Feature Engineering for Machine Learning – Alice Zheng & Amanda Casari