Scaling and Normalization

Table of Contents

Why Scaling Matters
Common Methods
Standardization (Z‑score)
Min‑Max Scaling
Robust Scaling
MaxAbs Scaling
Practical Tips
Python Code Snippets
References

Why Scaling Matters

Many machine‑learning algorithms assume that all features are on a comparable scale. Without proper scaling, models such as k‑nearest neighbors, logistic regression, support vector machines, and neural networks can converge slowly or produce biased results.

Common Scaling Methods

Method	When to Use	Characteristics
Standardization (Z‑score)	Normally distributed data	Centers data to mean 0, variance 1
Min‑Max Scaling	Bounded features	Rescales to a fixed range, usually [0,1]
Robust Scaling	Data with outliers	Uses median & IQR; less sensitive to outliers
MaxAbs Scaling	Sparse data	Preserves sparsity, scales to [-1,1]

Standardization (Z‑score)

Transforms each feature \(x\) to \((x-\mu)/\sigma\), where \(\mu\) is the mean and \(\sigma\) the standard deviation.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Min‑Max Scaling

Rescales each feature to a given range \([a, b]\), commonly \([0, 1]\).

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0,1))
X_scaled = scaler.fit_transform(X)

Robust Scaling

Uses the median and the inter‑quartile range (IQR) to scale the data, making it robust to outliers.

from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

MaxAbs Scaling

Scales each feature by its maximum absolute value, preserving zero entries and sparsity.

from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X)

Practical Tips

Fit scalers on the training set only; apply the same transformation to validation/test sets.
When using pipelines, place scaling steps before model training.
Inspect feature distributions after scaling to ensure expected behavior.
For tree‑based models (e.g., RandomForest, XGBoost) scaling is often unnecessary.

Full Example (Python)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# Load data
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']

# Train‑test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Build pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=1000))
])

# Train
pipeline.fit(X_train, y_train)

# Predict & evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

References

scikit‑learn Documentation – Preprocessing
Hands‑On Machine Learning with Scikit‑Learn, Keras & TensorFlow – Aurelien Géron
Feature Engineering for Machine Learning – Alice Zheng & Amanda Casari