ML Fundamentals: Regression

What is Regression?

Regression is a supervised machine learning technique used to predict a continuous numerical value. Unlike classification, which predicts discrete categories, regression aims to find the relationship between one or more independent variables (features) and a dependent variable (the target value). This relationship is typically modeled using a line or curve.

The goal is to build a model that can accurately estimate the target variable for new, unseen data based on the patterns learned from historical data.

Key Concepts

Independent Variables (Features): The input variables used to make predictions.
Dependent Variable (Target): The continuous numerical output we want to predict.
Model: The mathematical equation or algorithm that describes the relationship between independent and dependent variables.
Coefficients: Parameters within the model that determine the strength and direction of the relationship between features and the target.
Loss Function (Cost Function): Measures how well the model's predictions match the actual values. Common examples include Mean Squared Error (MSE) and Mean Absolute Error (MAE).
Optimization Algorithm: Used to minimize the loss function, thereby finding the best model parameters.

Types of Regression

Linear Regression

Assumes a linear relationship between independent and dependent variables.

Simple Linear Regression (one feature)
Multiple Linear Regression (multiple features)

y = β₀ + β₁x₁ + β₂x₂ + ... + βnxn + ε

Polynomial Regression

Models the relationship as an n-th degree polynomial.

Useful for capturing non-linear patterns.

y = β₀ + β₁x + β₂x² + ... + βnxⁿ + ε

Ridge Regression

A regularized form of linear regression that adds an L2 penalty term.

Helps prevent overfitting by shrinking coefficients.

Minimize [ Σ(yi - ŷi)² + λ Σ(βj)² ]

Lasso Regression

A regularized form of linear regression that adds an L1 penalty term.

Can perform feature selection by driving some coefficients to zero.

Minimize [ Σ(yi - ŷi)² + λ Σ|βj| ]

Support Vector Regression (SVR)

Extends Support Vector Machines to regression tasks.

Aims to fit the error within a certain threshold (epsilon).

Decision Tree Regression

Uses a tree-like structure to make predictions.

Splits data based on feature values.

Random Forest Regression

An ensemble method that builds multiple decision trees.

Averages predictions from individual trees to improve accuracy and reduce variance.

Applications of Regression

Predicting house prices: Based on features like size, location, and number of bedrooms.
Forecasting sales: Using historical sales data, marketing spend, and economic indicators.
Estimating stock prices: Analyzing market trends, company performance, and economic news.
Weather forecasting: Predicting temperature, rainfall, and wind speed.
Medical diagnosis: Estimating patient risk factors or disease progression.

Evaluation Metrics

Evaluating regression models is crucial to understand their performance. Some common metrics include:

Mean Squared Error (MSE): Average of the squared differences between predicted and actual values. Punishes larger errors more heavily.

MSE = (1/n) Σ(yi - ŷi)²

Root Mean Squared Error (RMSE): The square root of MSE. It's in the same units as the target variable, making it easier to interpret.

RMSE = √MSE

Mean Absolute Error (MAE): Average of the absolute differences between predicted and actual values. Less sensitive to outliers than MSE.

MAE = (1/n) Σ|yi - ŷi|

R-squared (Coefficient of Determination): Represents the proportion of the variance in the dependent variable that is predictable from the independent variables. Ranges from 0 to 1, where 1 indicates a perfect fit.

R² = 1 - (Σ(yi - ŷi)² / Σ(yi - ȳ)²)

Example: Simple Linear Regression in Python (Conceptual)

Here's a conceptual look at how you might implement and use a simple linear regression model.


import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Sample Data (replace with your actual data)
# X: Independent variable (e.g., years of experience)
# y: Dependent variable (e.g., salary)
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([30000, 35000, 40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000])

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Linear Regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Intercept (β₀): {model.intercept_}")
print(f"Coefficient (β₁): {model.coef_[0]}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R²): {r2}")

# Predict a new value
new_experience = np.array([[11]])
predicted_salary = model.predict(new_experience)
print(f"Predicted salary for 11 years experience: ${predicted_salary[0]:,.2f}")

# Expected Output (values may vary slightly due to train_test_split):
# Intercept (β₀): 25000.0
# Coefficient (β₁): 5000.0
# Mean Squared Error (MSE): 0.0
# R-squared (R²): 1.0
# Predicted salary for 11 years experience: $80,000.00

Machine Learning Fundamentals

Regression Analysis