Feature Engineering in Python for Data Science & Machine Learning
Introduction
Feature engineering is a crucial step in the machine learning pipeline. It involves using domain knowledge of the data to create new features or transform existing ones to improve the performance of machine learning models. This process can significantly impact a model's accuracy, interpretability, and generalization capabilities.
In this guide, we'll explore various techniques for feature engineering using Python's powerful data science libraries, including Pandas, NumPy, and Scikit-learn.
Table of Contents
1. Handling Missing Data
Missing values can cause issues for many machine learning algorithms. Common strategies include imputation.
Imputation
Replacing missing values with a statistical estimate, such as the mean, median, or mode.
import pandas as pd
from sklearn.impute import SimpleImputer
# Load your data
df = pd.read_csv('your_data.csv')
# Impute missing numerical values with the mean
imputer_mean = SimpleImputer(strategy='mean')
df[['numerical_column']] = imputer_mean.fit_transform(df[['numerical_column']])
# Impute missing categorical values with the most frequent
imputer_mode = SimpleImputer(strategy='most_frequent')
df[['categorical_column']] = imputer_mode.fit_transform(df[['categorical_column']])
2. Categorical Encoding
Machine learning models typically require numerical input. Categorical features need to be converted into a numerical format.
One-Hot Encoding
Creates binary columns for each category.
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
# Assuming df['categorical_column'] contains strings
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoded_features = encoder.fit_transform(df[['categorical_column']])
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(['categorical_column']))
df = pd.concat([df.drop('categorical_column', axis=1), encoded_df], axis=1)
Label Encoding
Assigns a unique integer to each category. Suitable for ordinal categories or tree-based models.
from sklearn.preprocessing import LabelEncoder
import pandas as pd
# Assuming df['ordinal_column'] contains ordered categories
le = LabelEncoder()
df['ordinal_column_encoded'] = le.fit_transform(df['ordinal_column'])
3. Numerical Transformations
Transforming numerical features can help models by making distributions more Gaussian-like or reducing skewness.
Log Transformation
Useful for highly skewed data.
import numpy as np
import pandas as pd
# Assuming df['skewed_column'] has positive values
df['log_skewed'] = np.log1p(df['skewed_column']) # log1p adds 1 before taking log
Box-Cox Transformation
A more general power transformation that includes log transformation as a special case.
from scipy.stats import boxcox
import pandas as pd
# Assuming df['highly_skewed'] has positive values
df['boxcox_transformed'], lambda_val = boxcox(df['highly_skewed'])
print(f"Optimal lambda for Box-Cox: {lambda_val}")
4. Feature Creation
Creating new features from existing ones can capture more complex relationships.
Polynomial Features
Creates interaction terms and polynomial terms.
from sklearn.preprocessing import PolynomialFeatures
import pandas as pd
# Assuming df['feature1'] and df['feature2']
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[['feature1', 'feature2']])
poly_feature_names = poly.get_feature_names_out(['feature1', 'feature2'])
poly_df = pd.DataFrame(poly_features, columns=poly_feature_names)
df = pd.concat([df, poly_df], axis=1)
Interaction Features
Combining features multiplicatively or additively.
import pandas as pd
df['feature1_x_feature2'] = df['feature1'] * df['feature2']
df['feature1_plus_feature2'] = df['feature1'] + df['feature2']
Domain-Specific Features
Creating features based on understanding the data.
Example: For a 'timestamp' column, extract day of week, month, hour, etc.
import pandas as pd
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['month'] = df['timestamp'].dt.month
df['hour'] = df['timestamp'].dt.hour
5. Feature Scaling
Scaling features to a similar range is important for algorithms sensitive to feature magnitudes, like SVMs, k-NN, and gradient descent-based models.
Standardization (Z-score scaling)
Scales data to have a mean of 0 and a standard deviation of 1.
from sklearn.preprocessing import StandardScaler
import pandas as pd
scaler = StandardScaler()
df[['numerical_feature']] = scaler.fit_transform(df[['numerical_feature']])
Min-Max Scaling
Scales data to a fixed range, usually 0 to 1.
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
scaler = MinMaxScaler()
df[['numerical_feature']] = scaler.fit_transform(df[['numerical_feature']])
6. Advanced Techniques
Feature Selection
Choosing the most relevant features to reduce dimensionality and improve model performance.
- Filter Methods (e.g., correlation, mutual information)
- Wrapper Methods (e.g., Recursive Feature Elimination - RFE)
- Embedded Methods (e.g., L1 regularization, tree-based feature importances)
Dimensionality Reduction
Techniques like PCA (Principal Component Analysis) can create new, fewer features that capture most of the variance in the data.
from sklearn.decomposition import PCA
import pandas as pd
pca = PCA(n_components=5) # Keep 5 principal components
reduced_data = pca.fit_transform(df_scaled)