Machine Learning Fundamentals

Regression Analysis

What is Regression?

Regression is a supervised machine learning technique used to predict a continuous numerical value. Unlike classification, which predicts discrete categories, regression aims to find the relationship between one or more independent variables (features) and a dependent variable (the target value). This relationship is typically modeled using a line or curve.

The goal is to build a model that can accurately estimate the target variable for new, unseen data based on the patterns learned from historical data.

Key Concepts

  • Independent Variables (Features): The input variables used to make predictions.
  • Dependent Variable (Target): The continuous numerical output we want to predict.
  • Model: The mathematical equation or algorithm that describes the relationship between independent and dependent variables.
  • Coefficients: Parameters within the model that determine the strength and direction of the relationship between features and the target.
  • Loss Function (Cost Function): Measures how well the model's predictions match the actual values. Common examples include Mean Squared Error (MSE) and Mean Absolute Error (MAE).
  • Optimization Algorithm: Used to minimize the loss function, thereby finding the best model parameters.

Types of Regression

Linear Regression

Assumes a linear relationship between independent and dependent variables.

  • Simple Linear Regression (one feature)
  • Multiple Linear Regression (multiple features)
y = β₀ + β₁x₁ + β₂x₂ + ... + βnxn + ε

Polynomial Regression

Models the relationship as an n-th degree polynomial.

  • Useful for capturing non-linear patterns.
y = β₀ + β₁x + β₂x² + ... + βnxⁿ + ε

Ridge Regression

A regularized form of linear regression that adds an L2 penalty term.

  • Helps prevent overfitting by shrinking coefficients.
Minimize [ Σ(yi - ŷi)² + λ Σ(βj)² ]

Lasso Regression

A regularized form of linear regression that adds an L1 penalty term.

  • Can perform feature selection by driving some coefficients to zero.
Minimize [ Σ(yi - ŷi)² + λ Σ|βj| ]

Support Vector Regression (SVR)

Extends Support Vector Machines to regression tasks.

  • Aims to fit the error within a certain threshold (epsilon).

Decision Tree Regression

Uses a tree-like structure to make predictions.

  • Splits data based on feature values.

Random Forest Regression

An ensemble method that builds multiple decision trees.

  • Averages predictions from individual trees to improve accuracy and reduce variance.

Applications of Regression

  • Predicting house prices: Based on features like size, location, and number of bedrooms.
  • Forecasting sales: Using historical sales data, marketing spend, and economic indicators.
  • Estimating stock prices: Analyzing market trends, company performance, and economic news.
  • Weather forecasting: Predicting temperature, rainfall, and wind speed.
  • Medical diagnosis: Estimating patient risk factors or disease progression.

Evaluation Metrics

Evaluating regression models is crucial to understand their performance. Some common metrics include:

  • Mean Squared Error (MSE): Average of the squared differences between predicted and actual values. Punishes larger errors more heavily.
  • MSE = (1/n) Σ(yi - ŷi)²
  • Root Mean Squared Error (RMSE): The square root of MSE. It's in the same units as the target variable, making it easier to interpret.
  • RMSE = √MSE
  • Mean Absolute Error (MAE): Average of the absolute differences between predicted and actual values. Less sensitive to outliers than MSE.
  • MAE = (1/n) Σ|yi - ŷi|
  • R-squared (Coefficient of Determination): Represents the proportion of the variance in the dependent variable that is predictable from the independent variables. Ranges from 0 to 1, where 1 indicates a perfect fit.
  • R² = 1 - (Σ(yi - ŷi)² / Σ(yi - ȳ)²)

Example: Simple Linear Regression in Python (Conceptual)

Here's a conceptual look at how you might implement and use a simple linear regression model.


import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Sample Data (replace with your actual data)
# X: Independent variable (e.g., years of experience)
# y: Dependent variable (e.g., salary)
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([30000, 35000, 40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000])

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Linear Regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Intercept (β₀): {model.intercept_}")
print(f"Coefficient (β₁): {model.coef_[0]}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R²): {r2}")

# Predict a new value
new_experience = np.array([[11]])
predicted_salary = model.predict(new_experience)
print(f"Predicted salary for 11 years experience: ${predicted_salary[0]:,.2f}")

# Expected Output (values may vary slightly due to train_test_split):
# Intercept (β₀): 25000.0
# Coefficient (β₁): 5000.0
# Mean Squared Error (MSE): 0.0
# R-squared (R²): 1.0
# Predicted salary for 11 years experience: $80,000.00