Feature Engineering in Python for Data Science & ML

Introduction

Feature engineering is a crucial step in the machine learning pipeline. It involves using domain knowledge of the data to create new features or transform existing ones to improve the performance of machine learning models. This process can significantly impact a model's accuracy, interpretability, and generalization capabilities.

In this guide, we'll explore various techniques for feature engineering using Python's powerful data science libraries, including Pandas, NumPy, and Scikit-learn.

1. Handling Missing Data

Missing values can cause issues for many machine learning algorithms. Common strategies include imputation.

Imputation

Replacing missing values with a statistical estimate, such as the mean, median, or mode.


import pandas as pd
from sklearn.impute import SimpleImputer

# Load your data
df = pd.read_csv('your_data.csv')

# Impute missing numerical values with the mean
imputer_mean = SimpleImputer(strategy='mean')
df[['numerical_column']] = imputer_mean.fit_transform(df[['numerical_column']])

# Impute missing categorical values with the most frequent
imputer_mode = SimpleImputer(strategy='most_frequent')
df[['categorical_column']] = imputer_mode.fit_transform(df[['categorical_column']])

2. Categorical Encoding

Machine learning models typically require numerical input. Categorical features need to be converted into a numerical format.

One-Hot Encoding

Creates binary columns for each category.


from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Assuming df['categorical_column'] contains strings
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoded_features = encoder.fit_transform(df[['categorical_column']])
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(['categorical_column']))
df = pd.concat([df.drop('categorical_column', axis=1), encoded_df], axis=1)

Label Encoding

Assigns a unique integer to each category. Suitable for ordinal categories or tree-based models.


from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Assuming df['ordinal_column'] contains ordered categories
le = LabelEncoder()
df['ordinal_column_encoded'] = le.fit_transform(df['ordinal_column'])

3. Numerical Transformations

Transforming numerical features can help models by making distributions more Gaussian-like or reducing skewness.

Log Transformation

Useful for highly skewed data.


import numpy as np
import pandas as pd

# Assuming df['skewed_column'] has positive values
df['log_skewed'] = np.log1p(df['skewed_column']) # log1p adds 1 before taking log

Box-Cox Transformation

A more general power transformation that includes log transformation as a special case.


from scipy.stats import boxcox
import pandas as pd

# Assuming df['highly_skewed'] has positive values
df['boxcox_transformed'], lambda_val = boxcox(df['highly_skewed'])
print(f"Optimal lambda for Box-Cox: {lambda_val}")

4. Feature Creation

Creating new features from existing ones can capture more complex relationships.

Polynomial Features

Creates interaction terms and polynomial terms.


from sklearn.preprocessing import PolynomialFeatures
import pandas as pd

# Assuming df['feature1'] and df['feature2']
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[['feature1', 'feature2']])
poly_feature_names = poly.get_feature_names_out(['feature1', 'feature2'])
poly_df = pd.DataFrame(poly_features, columns=poly_feature_names)
df = pd.concat([df, poly_df], axis=1)

Interaction Features

Combining features multiplicatively or additively.


import pandas as pd

df['feature1_x_feature2'] = df['feature1'] * df['feature2']
df['feature1_plus_feature2'] = df['feature1'] + df['feature2']

Domain-Specific Features

Creating features based on understanding the data.

Example: For a 'timestamp' column, extract day of week, month, hour, etc.


import pandas as pd

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['month'] = df['timestamp'].dt.month
df['hour'] = df['timestamp'].dt.hour

5. Feature Scaling

Scaling features to a similar range is important for algorithms sensitive to feature magnitudes, like SVMs, k-NN, and gradient descent-based models.

Standardization (Z-score scaling)

Scales data to have a mean of 0 and a standard deviation of 1.


from sklearn.preprocessing import StandardScaler
import pandas as pd

scaler = StandardScaler()
df[['numerical_feature']] = scaler.fit_transform(df[['numerical_feature']])

Min-Max Scaling

Scales data to a fixed range, usually 0 to 1.


from sklearn.preprocessing import MinMaxScaler
import pandas as pd

scaler = MinMaxScaler()
df[['numerical_feature']] = scaler.fit_transform(df[['numerical_feature']])

6. Advanced Techniques

Feature Selection

Choosing the most relevant features to reduce dimensionality and improve model performance.

Filter Methods (e.g., correlation, mutual information)
Wrapper Methods (e.g., Recursive Feature Elimination - RFE)
Embedded Methods (e.g., L1 regularization, tree-based feature importances)

Dimensionality Reduction

Techniques like PCA (Principal Component Analysis) can create new, fewer features that capture most of the variance in the data.


from sklearn.decomposition import PCA
import pandas as pd

pca = PCA(n_components=5) # Keep 5 principal components
reduced_data = pca.fit_transform(df_scaled)

Feature Engineering in Python for Data Science & Machine Learning