Data Preprocessing - Missing Value Imputation

Introduction

Missing values are a common problem in datasets. Dealing with them appropriately is crucial for accurate analysis and modeling. This guide covers various techniques for handling missing values, including understanding the types of missing data and choosing the most suitable imputation method.

Types of Missing Data

It's essential to understand why data is missing before choosing an imputation method.

Missing Completely at Random (MCAR): Missingness is unrelated to any observed or unobserved variables.
Missing at Random (MAR): Missingness depends on observed variables.
Missing Not at Random (MNAR): Missingness depends on unobserved variables. This is the most difficult type to handle.

Imputation Techniques

Here are several common techniques for imputing missing values:

1. Mean/Median Imputation


import numpy as np

data = np.array([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9]])

# Impute with the mean
mean_imputed = np.nanmean(data, axis=0)
data[np.isnan(data)] = mean_imputed

# Impute with the median
median_imputed = np.nanmedian(data, axis=0)
data[np.isnan(data)] = median_imputed

2. Constant Value Imputation


import pandas as pd

data = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, 6]})

data_filled = data.fillna(0)  # Impute with 0

print(data_filled)

3. Regression Imputation


import pandas as pd
from sklearn.linear_model import LinearRegression

# Assuming you have 'df' and columns 'A' and 'B'
X = df[['A', 'B']]
y = df['A']

model = LinearRegression()
model.fit(X, y)

# Impute missing values in 'A' using the model
missing_indices = df['A'].isnull()
missing_values = df[missing_indices]
predicted_values = model.predict(missing_values)
missing_values[missing_indices] = predicted_values

print(missing_values)

Considerations

When choosing an imputation method, consider the following:

The amount of missing data.
The type of missing data (MCAR, MAR, MNAR).
The potential impact of the imputation on your analysis.