Feature Engineering Guide

Scaling and Normalization

Table of Contents

Why Scaling Matters

Many machine‑learning algorithms assume that all features are on a comparable scale. Without proper scaling, models such as k‑nearest neighbors, logistic regression, support vector machines, and neural networks can converge slowly or produce biased results.

Common Scaling Methods

MethodWhen to UseCharacteristics
Standardization (Z‑score)Normally distributed dataCenters data to mean 0, variance 1
Min‑Max ScalingBounded featuresRescales to a fixed range, usually [0,1]
Robust ScalingData with outliersUses median & IQR; less sensitive to outliers
MaxAbs ScalingSparse dataPreserves sparsity, scales to [-1,1]

Standardization (Z‑score)

Transforms each feature \(x\) to \((x-\mu)/\sigma\), where \(\mu\) is the mean and \(\sigma\) the standard deviation.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Min‑Max Scaling

Rescales each feature to a given range \([a, b]\), commonly \([0, 1]\).

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0,1))
X_scaled = scaler.fit_transform(X)

Robust Scaling

Uses the median and the inter‑quartile range (IQR) to scale the data, making it robust to outliers.

from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

MaxAbs Scaling

Scales each feature by its maximum absolute value, preserving zero entries and sparsity.

from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X)

Practical Tips

Full Example (Python)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# Load data
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']

# Train‑test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Build pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=1000))
])

# Train
pipeline.fit(X_train, y_train)

# Predict & evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

References